前言

K3s 自带的 SQLite 应付普通的嵌入式小服务来说绰绰有余,但是对于公司这种动辄吃掉几个核的高频调度来说,虽然勉强能支撑起日常的响应,却总有时候出现奇奇怪怪的 BUG,我已经不止一次碰到了因为默认 ServiceAccount 和 Namespace 中的资源无法做实时绑定,导致 SpringCloud 一直无法正常启动的问题了,虽然可以通过万能重启做解决,但这远算不上高可用边缘部署方案

下面是官方给 K3s 画的一副简图,我们之前已经将默认的 LB 换成了 MetaLB 并表示效果非常优秀,这次我们也来缝缝补补,按照文档来说,K3s 支持以下的外接数据库


K3s supports the following datastore options

  • Embedded SQLite
  • PostgreSQL (certified against versions 10.7, 11.5, and 14.2)
  • MySQL (certified against versions 5.7 and 8.0)
  • MariaDB (certified against version 10.6.8)
  • Etcd (certified against version 3.5.4)
  • Embedded etcd for High Availability

k3s-architecture


因为想要为以后的 Cluster Metrics 做接口准备,所以就暂时不考虑内嵌 Etcd 的方式,而是选择自己外接,这样对于整个集群的运行状况有一个直观的了解,也方便做监控 —— K3s 砍去了太多东西换取轻量化

我们都知道 K8s 的默认 Datastore 是 Etcd,而 K3s 则是使用了一种称为 Kine 的组件将 Etcd 的 K/V 操作翻译为了关系数据库的语法,Kine 将自己的接口暴露给 K3s ApiServer,也就是说,在集群组件看起来,自己还是针对 Etcd 进行读写,然而如果我们设置真正的 Etcd Backend,Kine 会略过并直接暴露真实的 Etcd-servers 给 K3s ApiServer

PS: 如果不想对 Etcd 运维或者说对于掌控性要求没有那么高,推荐使用 DQLite (也就是 SQLite 的高可用版本),可以参考官方文档 (https://docs.k3s.io/installation/ha-embedded

其实在启动的时候也能看见 Kine 的状态

$ sudo k3s server
INFO[0000] Starting k3s v1.24.4+k3s1 (c3f830e9)
INFO[0000] Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s
INFO[0000] Configuring database table schema and indexes, this may take a moment...
INFO[0000] Database tables and indexes are up to date
INFO[0000] Kine available at unix://kine.sock
INFO[0000] Reconciling bootstrap data between datastore and disk

部署

首先需要找到当前 SQLite 的地址,方便后续进行备份或者查看数据

$ sudo ls /var/lib/rancher/k3s/server/db/ -alh
total 22M
drwx------ 2 root root 4.0K Aug 30 02:19 .
drwx------ 8 root root 4.0K Sep 12 08:47 ..
-rw-r--r-- 1 root root  11M Nov 18 07:03 state.db
-rw-r--r-- 1 root root  32K Nov 18 07:06 state.db-shm
-rw-r--r-- 1 root root  11M Nov 18 07:06 state.db-wal

安装 Etcd

下载二进制文件

$ mkdir etcd && cd etcd
$ wget https://github.com/etcd-io/etcd/releases/download/v3.4.22/etcd-v3.4.22-linux-amd64.tar.gz
$ tar xvf etcd-v3.4.22-linux-amd64.tar.gz
$ cd etcd-v3.4.22-linux-amd64/
$ sudo cp etcd* /usr/bin/

$ etcdctl version
etcdctl version: 3.4.22
API version: 3.4

$ etcd --version
etcd Version: 3.4.22
Git SHA: 1f05498
Go Version: go1.16.15
Go OS/Arch: linux/amd64

生成 CA 密钥对(可选)

默认情况下开启 Etcd 是可以裸连的,这里我们将它改为使用证书认证才能连接,如果不做安全要求,可以跳过这一步

$ mkdir ~/bin
$ curl -s -L -o ~/bin/cfssl https://pkg.cfssl.org/R1.2/cfssl_linux-amd64
$ curl -s -L -o ~/bin/cfssljson https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64
$ chmod +x ~/bin/{cfssl,cfssljson}
$ export PATH=$PATH:~/bin

配置 CA 相关的信息

$ mkdir ~/cfssl
$ cd ~/cfssl

将下面内容写入预配置文件,声明过期时间为 50 年以后

ca-config.json

{
    "signing": {
        "default": {
            "expiry": "438000h"
        },
        "profiles": {
            "server": {
                "expiry": "438000h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "server auth"
                ]
            },
            "client": {
                "expiry": "438000h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "client auth"
                ]
            },
            "peer": {
                "expiry": "438000h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "server auth",
                    "client auth"
                ]
            }
        }
    }
}

ca-csr.json

{
    "CN": "EtcdCA",
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "C": "US",
            "L": "CA",
            "ST": "San Francisco"
        }
    ]
}

接着通过声明的配置文件进行 CA 证书的生成

$ cfssl gencert -initca ca-csr.json | cfssljson -bare ca -
2022/11/21 02:21:47 [INFO] generating a new CA key and certificate from CSR
2022/11/21 02:21:47 [INFO] generate received request
2022/11/21 02:21:47 [INFO] received CSR
2022/11/21 02:21:47 [INFO] generating key: ecdsa-256
2022/11/21 02:21:47 [INFO] encoded CSR
2022/11/21 02:21:47 [INFO] signed certificate with serial number 674210528988775972581910506961594971423963387318
$ ls -al
total 28
drwxrwxr-x  2 example example 4096 Nov 21 02:21 .
drwxr-x--- 11 example example 4096 Nov 21 02:20 ..
-rw-rw-r--  1 example example  836 Nov 21 02:17 ca-config.json
-rw-r--r--  1 example example  420 Nov 21 02:21 ca.csr
-rw-rw-r--  1 example example  211 Nov 21 02:20 ca-csr.json
-rw-------  1 example example  227 Nov 21 02:21 ca-key.pem
-rw-rw-r--  1 example example  733 Nov 21 02:21 ca.pem

根据 CA 证书生成服务端证书和客户端证书(可选)

CA 证书只是我们自己签发的一个机构,如果要实际应用的话,需要在 CA 的基础上签发 Server 证书,证书配置里面需要包含来源 IP 等认证信息

$ cfssl print-defaults csr > server.json # 生成默认配置
$ cat server.json # 在 hosts 字段中加入签发地址,如果后续更换ip或者hostname,需要重新签发,需要加上主机的hostname,不然日志会一直报错
{
    "CN": "etcd",
    "hosts": [
        "127.0.0.1",
        "10.10.3.104",
				"localhost",
				"etcd-cluster",
        "example-104"
    ],
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "C": "US",
            "L": "CA",
            "ST": "San Francisco"
        }
    ]
}
$ cfssl gencert -ca=ca.pem \
	-ca-key=ca-key.pem -config=ca-config.json \
	-profile=server server.json | cfssljson -bare server # 根据配置文件和CA进行签发证书
2022/11/21 02:33:23 [INFO] generate received request
2022/11/21 02:33:23 [INFO] received CSR
2022/11/21 02:33:23 [INFO] generating key: ecdsa-256
2022/11/21 02:33:23 [INFO] encoded CSR
2022/11/21 02:33:23 [INFO] signed certificate with serial number 496743626869441272572263826912386629066258177258
2022/11/21 02:33:23 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").

$ cfssl print-defaults csr > client.json
$ cat client.json # 将hosts为空,代表对客户端不作限制
{
    "CN": "client",
    "hosts": [""],
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "C": "US",
            "L": "CA",
            "ST": "San Francisco"
        }
    ]
}
$ cfssl gencert -ca=ca.pem \
	-ca-key=ca-key.pem \
	-config=ca-config.json \
	-profile=client client.json | cfssljson -bare client
2022/11/21 03:46:49 [INFO] generate received request
2022/11/21 03:46:49 [INFO] received CSR
2022/11/21 03:46:49 [INFO] generating key: ecdsa-256
2022/11/21 03:46:49 [INFO] encoded CSR
2022/11/21 03:46:49 [INFO] signed certificate with serial number 585341329955110412865129565126433228043943555598
2022/11/21 03:46:49 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").

$ ls -al
total 60
drwxrwxr-x  2 example example 4096 Nov 21 03:46 .
drwxr-x--- 11 example example 4096 Nov 21 03:45 ..
-rw-rw-r--  1 example example  836 Nov 21 02:17 ca-config.json
-rw-r--r--  1 example example  420 Nov 21 02:21 ca.csr
-rw-rw-r--  1 example example  211 Nov 21 02:20 ca-csr.json
-rw-------  1 example example  227 Nov 21 02:21 ca-key.pem
-rw-rw-r--  1 example example  733 Nov 21 02:21 ca.pem
-rw-r--r--  1 example example  460 Nov 21 03:46 client.csr
-rw-rw-r--  1 example example  230 Nov 21 03:45 client.json
-rw-------  1 example example  227 Nov 21 03:46 client-key.pem
-rw-rw-r--  1 example example  774 Nov 21 03:46 client.pem
-rw-r--r--  1 example example  509 Nov 21 02:33 server.csr
-rw-rw-r--  1 example example  305 Nov 21 02:33 server.json
-rw-------  1 example example  227 Nov 21 02:33 server-key.pem
-rw-rw-r--  1 example example  818 Nov 21 02:33 server.pem

这个文件夹可以妥善保存,用以日后复用

编写 Systemd 启动服务文件

这个是最简单的,不带任何 TLS 加密的启动配置文件

$ cat etcd.service
[Unit]
Description=Etcd DataStore Service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=1
User=root
ExecStart=/usr/bin/etcd

[Install]
WantedBy=multi-user.target

$ sudo cp etcd.service /etc/systemd/system/
$ sudo systemctl daemon-reload
$ sudo systemctl restart etcd.service
$ sudo systemctl status etcd.service
● etcd.service - Etcd DataStore Service
     Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-11-18 07:23:11 UTC; 6s ago
   Main PID: 1507 (etcd)
      Tasks: 10 (limit: 18879)
     Memory: 5.7M
        CPU: 98ms
     CGroup: /system.slice/etcd.service
             └─1507 /usr/bin/etcd

Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: 8e9e05c52164694d became candidate at term 2
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c5216>
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: 8e9e05c52164694d became leader at term 2
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c5216>
Nov 18 07:23:12 example-104 etcd[1507]: setting up the initial cluster version to 3.4
Nov 18 07:23:12 example-104 etcd[1507]: published {Name:default ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8>
Nov 18 07:23:12 example-104 etcd[1507]: ready to serve client requests
Nov 18 07:23:12 example-104 etcd[1507]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Nov 18 07:23:12 example-104 etcd[1507]: set the initial cluster version to 3.4
Nov 18 07:23:12 example-104 etcd[1507]: enabled capabilities for version 3.4

如果使用了 TLS,可以选择喜欢的认证方式:

  • 用HTTPS的客户端到服务器端传输安全:即 Etcd 带着服务端证书启动,传入的客户端携带 CA 进行认证
  • 用HTTPS客户端证书的客户端到服务器端认证:客户端将向服务器提供证书,服务器将检查证书是否由CA签名,并决定是否服务请求

各有各的好处,例如第一种可以选择把 CA 证书放入系统的可信证书目录中,通常位于 /etc/pki/tls/certs/etc/ssl/certs 中,就很适用在绝对可信的环境中,因为系统中全部应用都可以通过认证,而第二种更安全,客户端每次都需要携带证书来进行匹配,在 K3s 启动的时候将其配置即可

配置 TLS 后的启动方式

$ cat etcd.service
[Unit]
Description=Etcd DataStore Service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=1
User=root
ExecStart=/usr/bin/etcd --name infra0 --data-dir infra0 \
        --client-cert-auth=false \
				--trusted-ca-file=/etc/etcd/ca.pem \
				--cert-file=/etc/etcd/server.pem \
				--key-file=/etc/etcd/server-key.pem \
				--listen-client-urls=https://127.0.0.1:2379 \
        --listen-peer-urls=http://127.0.0.1:2380 \
        --advertise-client-urls=https://127.0.0.1:2379 \
        --initial-cluster=http://127.0.0.1:2380 \
        --initial-advertise-peer-urls=https://127.0.0.1:2380

[Install]
WantedBy=multi-user.target

$ sudo systemctl status etcd.service
● etcd.service - Etcd DataStore Service
     Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-11-21 03:30:05 UTC; 2s ago
   Main PID: 6305 (etcd)
      Tasks: 9 (limit: 18879)
     Memory: 5.7M
        CPU: 99ms
     CGroup: /system.slice/etcd.service
             └─6305 /usr/bin/etcd --name infra0 --data-dir infra0 --client-cert-auth --trusted-ca-file=/etc/etcd/ca.pem --cert-file=/etc/etcd/server.pem --key-file=/etc>

Nov 21 03:30:05 example-104 etcd[6305]: listening for peers on 127.0.0.1:2380
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d is starting a new election at term 3
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d became candidate at term 4
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 4
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d became leader at term 4
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 4
Nov 21 03:30:07 example-104 etcd[6305]: published {Name:infra0 ClientURLs:[https://127.0.0.1:2379]} to cluster cdf818194e3a8c32
Nov 21 03:30:07 example-104 etcd[6305]: ready to serve client requests
Nov 21 03:30:07 example-104 etcd[6305]: serving client requests on 127.0.0.1:2379

另外,如果使用第一种方式的话,可以只带下面参数

$ etcd --name infra0 --data-dir infra0 \
  --cert-file=/path/to/server.crt --key-file=/path/to/server.key \
  --advertise-client-urls=https://127.0.0.1:2379 --listen-client-urls=https://127.0.0.1:2379

验证

$ curl https://127.0.0.1:2379 # 直连
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

# 带上客户端密钥,恢复正常
$ curl --cacert /etc/etcd/ca.pem --cert /etc/etcd/client.pem --key /etc/etcd/client-key.pem https://127.0.0.1:2379/ -I
HTTP/2 404
access-control-allow-headers: accept, content-type, authorization
access-control-allow-methods: POST, GET, OPTIONS, PUT, DELETE
access-control-allow-origin: *
content-type: text/plain; charset=utf-8
x-content-type-options: nosniff
content-length: 19
date: Mon, 21 Nov 2022 03:52:44 GMT

集群部署

外置数据存储服务做完了,就该正式进行集群部署了

$ K3S_VERSION=v1.22.7+k3s1
$ curl -sfL https://get.k3s.io/ | K3S_TOKEN=SECRET \
	INSTALL_K3S_VERSION=$K3S_VERSION sh -s - --disable=servicelb \
	--write-kubeconfig-mode=0644 --no-deploy traefik \
	--datastore-endpoint="https://127.0.0.1:2379" \
	--datastore-cafile="/etc/etcd/ca.pem" \
	--datastore-certfile="/etc/etcd/client.pem" \
	--datastore-keyfile="/etc/etcd/client-key.pem"

$ cat /var/log/syslog # 可选,用于出错时候查看具体信息

$ kubectl get nodes # 查看集群就绪
NAME         STATUS   ROLES                  AGE     VERSION
example-104   Ready    control-plane,master   3m19s   v1.22.7+k3s1

Etcd 一些快捷命令

既然是外置存储了,那肯定意味着需要单独运维该组件,可以配置一下相关指令,更多就不举例了

# 追加环境变量到 profle 或者相关配置中

$ vim ~/.zshrc
export ETCDCTL_ENDPOINTS='https://127.0.0.1:2379'
export ETCDCTL_CACERT='/etc/etcd/ca.pem'
export ETCDCTL_CERT='/etc/etcd/client.pem'
export ETCDCTL_KEY='/etc/etcd/client-key.pem'
export ETCDCTL_API=3

$ etcdctl member list -w table # 查看Etcd成员状态
+------------------+---------+--------+-----------------------+------------------------+------------+
|        ID        | STATUS  |  NAME  |      PEER ADDRS       |      CLIENT ADDRS      | IS LEARNER |
+------------------+---------+--------+-----------------------+------------------------+------------+
| 8e9e05c52164694d | started | infra0 | http://localhost:2380 | https://127.0.0.1:2379 |      false |
+------------------+---------+--------+-----------------------+------------------------+------------+

$ etcdctl check perf # 性能测试
 60 / 60 Boooooooooooooooooooooo! 100.00% 1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.381208s
PASS: Stddev is 0.034952s
PASS

$ etcdctl endpoint status -w table
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://127.0.0.1:2379 | 8e9e05c52164694d |  3.4.22 |   24 MB |      true |      false |         8 |      17010 |              17010 |        |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

$ etcdctl endpoint health -w table # 集群健康检查
+------------------------+--------+------------+-------+
|        ENDPOINT        | HEALTH |    TOOK    | ERROR |
+------------------------+--------+------------+-------+
| https://127.0.0.1:2379 |   true | 3.957756ms |       |
+------------------------+--------+------------+-------+

针对更换IP后ETCD监听不启动问题

做了 ETCD 高可用后,如果我们对环境进行调整,例如更换 IP,会导致 ETCD 的监听出现异常,从而导致集群一直处于 api-server not ready 状态,我们可以使用下面方法手动调整

$ memberid=$(sudo ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' \
ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt' \
ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt' \
ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key' \
ETCDCTL_API=3 etcdctl member list | awk -F',' '{print $1}')

$ sudo ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' \
ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt' \
ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt' \
ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key' \
ETCDCTL_API=3 \
etcdctl member update $memberid --peer-urls="https://变更后的IP:2380"