经验杂谈

January 24, 2022 • 4 分钟

这里汇集了平常学习和工作中遇到的一些小小的疑难杂症

Linux Service

Nginx

xxx.so is not binary compatible in /etc/nginx/nginx.conf
这个通常发生在编译 Nginx 模块后，我们通过 load_module 加载该模块发出的报错，针对这种的解决方案也很简单，加上 --with-compat 即可，例如
./configure --with-http_image_filter_module=dynamic --with-compat

Kubernetes

数据库中间件问题

MongoDB

ReplicaSetNoPrimary 问题

错误日志：

server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: xxx:27017, Type: Unknown, Last error: connection() error occured during connection handshake: connection(xxx:27017[-127]) socket was unexpectedly closed: EOF }, { Addr: xxx:27017, Type: Unknown, Last error: connection() error occured during connection handshake: connection(xxx:27017[-128]) socket was unexpectedly closed: EOF }, ] }

从表面上看，代码 SDK 连接到了 Mongo 后，Mongo 返回来一个 ReplicaSet 集合，其中里面没有我可以访问的地址（返回了集群 POD 的 Name），其主要的原因是使用的 SDK 较老，那时候还没有集群化管理这种东西，不具备服务发现的功能解决方案： In contrast to the mongo-go-driver, by default it would perform server discovery and attempt to connect as a replica set. If you would like to connect as a single node, then you need to specify connect=direct in the connection URI. 例如，在 PyMongo 的连接中加入 directConnection = True，或者类似方案

存储件问题

Longhorn PVC创建后无法被挂载

现象

Pod 一直处于 ContainerCreating 或者 Initing 状态
PVC 已经被正确创建

查看 Pod 的详情，显示 /dev/longhorn/pvc-f741d71f-c210-4707-85c6-39dc9e85d835 is apparently in use by the system; will not make a filesystem here!

查看 CSI 日志

$ kubectl logs -n longhorn-system longhorn-csi-plugin-nzzzp longhorn-csi-plugin

E0407 06:08:27.528594 3158170 mount_linux.go:184] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73/globalmount: /dev/longhorn/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73 already mounted or mount point busy.

time="2022-04-07T06:08:27Z" level=error msg="NodeStageVolume: err: rpc error: code = Internal desc = mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73/globalmount\nOutput: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73/globalmount: /dev/longhorn/pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73 already mounted or mount point busy.\n"

原因

This is caused by multipath creating a multipath device for any eligible device path including every Longhorn volume device not explicitly blacklisted.

解决方案

Find the major:minor number of the Longhorn device. On the node, try ls -l /dev/longhorn/. The major:minor number will be shown as e.g. 8, 32 before the device name.
Find what’s the device generated by Linux for the same major:minor number. Use ls -l /dev and find the device for the same major:minor number, e.g. /dev/sde.
Find the process. Use lsof to get the list of file handlers in use, then grep for the device name (e.g. sde or /dev/longhorn/xxx. You should find one there.

# 问题ID：pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73
$ ls /dev/longhorn -l | grep pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73
brw-rw---- 1 root root 65, 224 Apr  7 05:57 pvc-daf9bf27-633f-4115-9ff2-cc7c4911ce73

$ ls /dev -l | grep "65, 224"
brw-rw----  1 root disk     65, 224 Apr  7 05:57 sdae

$ sudo vim /etc/multipath.conf
# 追加或者写入
blacklist {
    devnode "^sda[a-z0-9]+"
}

$ sudo systemctl restart multipathd.service

集群本身问题错误日志

MetricsServer 网络异常

现象

"Failed to scrape node" err="Get \"https://10.10.3.118:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 10.10.3.118:10250: connect: no route to host" node="moying-r350-1"
I1226 06:31:31.849631       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I1226 06:31:33.849166       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I1226 06:31:35.849019       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I1226 06:31:37.849142       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"

同时就绪检查呈现 500 错误，一直不就绪

排错

$ curl "https://10.10.3.118:10250/stats/summary?only_cpu_and_memory=true" -k
Unauthorized%

尝试访问日志中的 URL 是完全正常的，问题表现在集群无法正常访问到对应的路由，推测可能是 iptables 或者 CNI 问题，同时检查了 /etc/sysctl.conf 文件中的路由转发已经开启。

解决方案

将默认的 Flannel VxLan 模式切换为 Flannel Host-Gw 模式，并且清理防火墙规则即可

$ sudo vim `systemctl status k3s 2>&1|grep loaded|awk -F '[(;]' '{print $2}'`
# 写入
...
'--flannel-backend' \
        'host-gw'
...
$ sudo systemctl daemon-reload 
$ sudo systemctl restart k3s.service
$ sudo cat /var/lib/rancher/k3s/agent/etc/flannel/net-conf.json
{
        "Network": "10.42.0.0/16",
        "EnableIPv6": false,
        "IPv6Network": "::/0",
        "Backend": {
        "Type": "host-gw"
}

$ sudo iptables -tnat --flush
$ sudo iptables --flush

删除 Terminating 下的 Namespace

在某些情况下，调用删除 Namespace 的指令后，Namespace 一直处于 Terminating 状态查看命名空间详情

$ kubectl get ns -o yaml -n longhorn-system
...
        instances, nodes.longhorn.io has 1 resource instances'
      reason: SomeResourcesRemain
      status: "True"
      type: NamespaceContentRemaining
    - lastTransitionTime: "2022-03-29T01:40:54Z"
      message: 'Some content in the namespace has finalizers remaining: longhorn.io
        in 2 resource instances'
      reason: SomeFinalizersRemain
      status: "True"
      type: NamespaceFinalizersRemaining
    phase: Terminating
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

普通查看并删除该命名空间下的资源

$ kubectl api-resources -o name --verbs=list --namespaced | xargs -n 1 kubectl get --show-kind --ignore-not-found -n longhorn-system
NAME                                  STATE      IMAGE                                    REFCOUNT   BUILDDATE   AGE
engineimage.longhorn.io/ei-b907910b   deployed   longhornio/longhorn-engine:master-head   0          25d         16m
NAME                                   READY   ALLOWSCHEDULING   SCHEDULABLE   AGE
node.longhorn.io/debian-gnu-linux-10   True    true              True          16m

$ kubectl delete crd engineimages.longhorn.io
customresourcedefinition.apiextensions.k8s.io "engineimages.longhorn.io" deleted

然而大部分时间中，我们是无法清理掉这些资源的，即使是强制删除（我在清理CRD的时候就卡住了）

$ kubectl delete ns longhorn-system --force --grace-period=0

调用原生接口进行删除

查看 Namespace 定义的 json 配置，编辑 json 文件并删除掉 Spec 部分。

$ kubectl get ns longhorn-system -o json > ns.json
$ vim ns.json
 "kubernetes.io/metadata.name": "longhorn-system"
        },
        "name": "longhorn-system",
        "resourceVersion": "9604",
        "uid": "18036516-0a03-4a40-8353-ba8e68729fbf"
    },
    "spec": {
        "finalizers": [
            "kubernetes"
        ]
    },
    "status": {
        "conditions": [
            {
                "lastTransitionTime": "2022-03-29T01:40:54Z",
                "message": "All resources successfully discovered",
                "reason": "ResourcesDiscovered",
                "status": "False",
                "type": "NamespaceDeletionDiscoveryFailure"
            },

$ curl  -k -H "Content-Type:application/json" -X PUT --data-binary @ns.json https://127.0.0.1:6443/api/v1/namespaces/longhorn-system/finalize

字体选项

经验杂谈

Linux Service

Nginx

Kubernetes

数据库中间件问题

MongoDB

ReplicaSetNoPrimary 问题

存储件问题

Longhorn PVC创建后无法被挂载

现象

原因

解决方案

集群本身问题错误日志

MetricsServer 网络异常

现象

排错

解决方案

删除 Terminating 下的 Namespace

普通查看并删除该命名空间下的资源

调用原生接口进行删除