前言
K3s 自带的 SQLite 应付普通的嵌入式小服务来说绰绰有余,但是对于公司这种动辄吃掉几个核的高频调度来说,虽然勉强能支撑起日常的响应,却总有时候出现奇奇怪怪的 BUG,我已经不止一次碰到了因为默认 ServiceAccount 和 Namespace 中的资源无法做实时绑定,导致 SpringCloud 一直无法正常启动的问题了,虽然可以通过万能重启做解决,但这远算不上高可用边缘部署方案
下面是官方给 K3s 画的一副简图,我们之前已经将默认的 LB 换成了 MetaLB 并表示效果非常优秀,这次我们也来缝缝补补,按照文档来说,K3s 支持以下的外接数据库
K3s supports the following datastore options
- Embedded SQLite
- PostgreSQL (certified against versions 10.7, 11.5, and 14.2)
- MySQL (certified against versions 5.7 and 8.0)
- MariaDB (certified against version 10.6.8)
- Etcd (certified against version 3.5.4)
- Embedded etcd for High Availability
因为想要为以后的 Cluster Metrics 做接口准备,所以就暂时不考虑内嵌 Etcd 的方式,而是选择自己外接,这样对于整个集群的运行状况有一个直观的了解,也方便做监控 —— K3s 砍去了太多东西换取轻量化
我们都知道 K8s 的默认 Datastore 是 Etcd,而 K3s 则是使用了一种称为 Kine 的组件将 Etcd 的 K/V 操作翻译为了关系数据库的语法,Kine 将自己的接口暴露给 K3s ApiServer,也就是说,在集群组件看起来,自己还是针对 Etcd 进行读写,然而如果我们设置真正的 Etcd Backend,Kine 会略过并直接暴露真实的 Etcd-servers 给 K3s ApiServer
PS: 如果不想对 Etcd 运维或者说对于掌控性要求没有那么高,推荐使用 DQLite (也就是 SQLite 的高可用版本),可以参考官方文档 (https://docs.k3s.io/installation/ha-embedded)
其实在启动的时候也能看见 Kine 的状态
$ sudo k3s server
INFO[0000] Starting k3s v1.24.4+k3s1 (c3f830e9)
INFO[0000] Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s
INFO[0000] Configuring database table schema and indexes, this may take a moment...
INFO[0000] Database tables and indexes are up to date
INFO[0000] Kine available at unix://kine.sock
INFO[0000] Reconciling bootstrap data between datastore and disk
部署
首先需要找到当前 SQLite 的地址,方便后续进行备份或者查看数据
$ sudo ls /var/lib/rancher/k3s/server/db/ -alh
total 22M
drwx------ 2 root root 4.0K Aug 30 02:19 .
drwx------ 8 root root 4.0K Sep 12 08:47 ..
-rw-r--r-- 1 root root 11M Nov 18 07:03 state.db
-rw-r--r-- 1 root root 32K Nov 18 07:06 state.db-shm
-rw-r--r-- 1 root root 11M Nov 18 07:06 state.db-wal
安装 Etcd
下载二进制文件
$ mkdir etcd && cd etcd
$ wget https://github.com/etcd-io/etcd/releases/download/v3.4.22/etcd-v3.4.22-linux-amd64.tar.gz
$ tar xvf etcd-v3.4.22-linux-amd64.tar.gz
$ cd etcd-v3.4.22-linux-amd64/
$ sudo cp etcd* /usr/bin/
$ etcdctl version
etcdctl version: 3.4.22
API version: 3.4
$ etcd --version
etcd Version: 3.4.22
Git SHA: 1f05498
Go Version: go1.16.15
Go OS/Arch: linux/amd64
生成 CA 密钥对(可选)
默认情况下开启 Etcd 是可以裸连的,这里我们将它改为使用证书认证才能连接,如果不做安全要求,可以跳过这一步
$ mkdir ~/bin
$ curl -s -L -o ~/bin/cfssl https://pkg.cfssl.org/R1.2/cfssl_linux-amd64
$ curl -s -L -o ~/bin/cfssljson https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64
$ chmod +x ~/bin/{cfssl,cfssljson}
$ export PATH=$PATH:~/bin
配置 CA 相关的信息
$ mkdir ~/cfssl
$ cd ~/cfssl
将下面内容写入预配置文件,声明过期时间为 50 年以后
ca-config.json
{
"signing": {
"default": {
"expiry": "438000h"
},
"profiles": {
"server": {
"expiry": "438000h",
"usages": [
"signing",
"key encipherment",
"server auth"
]
},
"client": {
"expiry": "438000h",
"usages": [
"signing",
"key encipherment",
"client auth"
]
},
"peer": {
"expiry": "438000h",
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
]
}
}
}
}
ca-csr.json
{
"CN": "EtcdCA",
"key": {
"algo": "ecdsa",
"size": 256
},
"names": [
{
"C": "US",
"L": "CA",
"ST": "San Francisco"
}
]
}
接着通过声明的配置文件进行 CA 证书的生成
$ cfssl gencert -initca ca-csr.json | cfssljson -bare ca -
2022/11/21 02:21:47 [INFO] generating a new CA key and certificate from CSR
2022/11/21 02:21:47 [INFO] generate received request
2022/11/21 02:21:47 [INFO] received CSR
2022/11/21 02:21:47 [INFO] generating key: ecdsa-256
2022/11/21 02:21:47 [INFO] encoded CSR
2022/11/21 02:21:47 [INFO] signed certificate with serial number 674210528988775972581910506961594971423963387318
$ ls -al
total 28
drwxrwxr-x 2 example example 4096 Nov 21 02:21 .
drwxr-x--- 11 example example 4096 Nov 21 02:20 ..
-rw-rw-r-- 1 example example 836 Nov 21 02:17 ca-config.json
-rw-r--r-- 1 example example 420 Nov 21 02:21 ca.csr
-rw-rw-r-- 1 example example 211 Nov 21 02:20 ca-csr.json
-rw------- 1 example example 227 Nov 21 02:21 ca-key.pem
-rw-rw-r-- 1 example example 733 Nov 21 02:21 ca.pem
根据 CA 证书生成服务端证书和客户端证书(可选)
CA 证书只是我们自己签发的一个机构,如果要实际应用的话,需要在 CA 的基础上签发 Server 证书,证书配置里面需要包含来源 IP 等认证信息
$ cfssl print-defaults csr > server.json # 生成默认配置
$ cat server.json # 在 hosts 字段中加入签发地址,如果后续更换ip或者hostname,需要重新签发,需要加上主机的hostname,不然日志会一直报错
{
"CN": "etcd",
"hosts": [
"127.0.0.1",
"10.10.3.104",
"localhost",
"etcd-cluster",
"example-104"
],
"key": {
"algo": "ecdsa",
"size": 256
},
"names": [
{
"C": "US",
"L": "CA",
"ST": "San Francisco"
}
]
}
$ cfssl gencert -ca=ca.pem \
-ca-key=ca-key.pem -config=ca-config.json \
-profile=server server.json | cfssljson -bare server # 根据配置文件和CA进行签发证书
2022/11/21 02:33:23 [INFO] generate received request
2022/11/21 02:33:23 [INFO] received CSR
2022/11/21 02:33:23 [INFO] generating key: ecdsa-256
2022/11/21 02:33:23 [INFO] encoded CSR
2022/11/21 02:33:23 [INFO] signed certificate with serial number 496743626869441272572263826912386629066258177258
2022/11/21 02:33:23 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").
$ cfssl print-defaults csr > client.json
$ cat client.json # 将hosts为空,代表对客户端不作限制
{
"CN": "client",
"hosts": [""],
"key": {
"algo": "ecdsa",
"size": 256
},
"names": [
{
"C": "US",
"L": "CA",
"ST": "San Francisco"
}
]
}
$ cfssl gencert -ca=ca.pem \
-ca-key=ca-key.pem \
-config=ca-config.json \
-profile=client client.json | cfssljson -bare client
2022/11/21 03:46:49 [INFO] generate received request
2022/11/21 03:46:49 [INFO] received CSR
2022/11/21 03:46:49 [INFO] generating key: ecdsa-256
2022/11/21 03:46:49 [INFO] encoded CSR
2022/11/21 03:46:49 [INFO] signed certificate with serial number 585341329955110412865129565126433228043943555598
2022/11/21 03:46:49 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").
$ ls -al
total 60
drwxrwxr-x 2 example example 4096 Nov 21 03:46 .
drwxr-x--- 11 example example 4096 Nov 21 03:45 ..
-rw-rw-r-- 1 example example 836 Nov 21 02:17 ca-config.json
-rw-r--r-- 1 example example 420 Nov 21 02:21 ca.csr
-rw-rw-r-- 1 example example 211 Nov 21 02:20 ca-csr.json
-rw------- 1 example example 227 Nov 21 02:21 ca-key.pem
-rw-rw-r-- 1 example example 733 Nov 21 02:21 ca.pem
-rw-r--r-- 1 example example 460 Nov 21 03:46 client.csr
-rw-rw-r-- 1 example example 230 Nov 21 03:45 client.json
-rw------- 1 example example 227 Nov 21 03:46 client-key.pem
-rw-rw-r-- 1 example example 774 Nov 21 03:46 client.pem
-rw-r--r-- 1 example example 509 Nov 21 02:33 server.csr
-rw-rw-r-- 1 example example 305 Nov 21 02:33 server.json
-rw------- 1 example example 227 Nov 21 02:33 server-key.pem
-rw-rw-r-- 1 example example 818 Nov 21 02:33 server.pem
这个文件夹可以妥善保存,用以日后复用
编写 Systemd 启动服务文件
这个是最简单的,不带任何 TLS 加密的启动配置文件
$ cat etcd.service
[Unit]
Description=Etcd DataStore Service
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
Restart=always
RestartSec=1
User=root
ExecStart=/usr/bin/etcd
[Install]
WantedBy=multi-user.target
$ sudo cp etcd.service /etc/systemd/system/
$ sudo systemctl daemon-reload
$ sudo systemctl restart etcd.service
$ sudo systemctl status etcd.service
● etcd.service - Etcd DataStore Service
Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2022-11-18 07:23:11 UTC; 6s ago
Main PID: 1507 (etcd)
Tasks: 10 (limit: 18879)
Memory: 5.7M
CPU: 98ms
CGroup: /system.slice/etcd.service
└─1507 /usr/bin/etcd
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: 8e9e05c52164694d became candidate at term 2
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c5216>
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: 8e9e05c52164694d became leader at term 2
Nov 18 07:23:12 example-104 etcd[1507]: raft2022/11/18 07:23:12 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c5216>
Nov 18 07:23:12 example-104 etcd[1507]: setting up the initial cluster version to 3.4
Nov 18 07:23:12 example-104 etcd[1507]: published {Name:default ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8>
Nov 18 07:23:12 example-104 etcd[1507]: ready to serve client requests
Nov 18 07:23:12 example-104 etcd[1507]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Nov 18 07:23:12 example-104 etcd[1507]: set the initial cluster version to 3.4
Nov 18 07:23:12 example-104 etcd[1507]: enabled capabilities for version 3.4
如果使用了 TLS,可以选择喜欢的认证方式:
- 用HTTPS的客户端到服务器端传输安全:即 Etcd 带着服务端证书启动,传入的客户端携带 CA 进行认证
- 用HTTPS客户端证书的客户端到服务器端认证:客户端将向服务器提供证书,服务器将检查证书是否由CA签名,并决定是否服务请求
各有各的好处,例如第一种可以选择把 CA 证书放入系统的可信证书目录中,通常位于 /etc/pki/tls/certs
或 /etc/ssl/certs
中,就很适用在绝对可信的环境中,因为系统中全部应用都可以通过认证,而第二种更安全,客户端每次都需要携带证书来进行匹配,在 K3s 启动的时候将其配置即可
配置 TLS 后的启动方式
$ cat etcd.service
[Unit]
Description=Etcd DataStore Service
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
Restart=always
RestartSec=1
User=root
ExecStart=/usr/bin/etcd --name infra0 --data-dir infra0 \
--client-cert-auth=false \
--trusted-ca-file=/etc/etcd/ca.pem \
--cert-file=/etc/etcd/server.pem \
--key-file=/etc/etcd/server-key.pem \
--listen-client-urls=https://127.0.0.1:2379 \
--listen-peer-urls=http://127.0.0.1:2380 \
--advertise-client-urls=https://127.0.0.1:2379 \
--initial-cluster=http://127.0.0.1:2380 \
--initial-advertise-peer-urls=https://127.0.0.1:2380
[Install]
WantedBy=multi-user.target
$ sudo systemctl status etcd.service
● etcd.service - Etcd DataStore Service
Loaded: loaded (/etc/systemd/system/etcd.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2022-11-21 03:30:05 UTC; 2s ago
Main PID: 6305 (etcd)
Tasks: 9 (limit: 18879)
Memory: 5.7M
CPU: 99ms
CGroup: /system.slice/etcd.service
└─6305 /usr/bin/etcd --name infra0 --data-dir infra0 --client-cert-auth --trusted-ca-file=/etc/etcd/ca.pem --cert-file=/etc/etcd/server.pem --key-file=/etc>
Nov 21 03:30:05 example-104 etcd[6305]: listening for peers on 127.0.0.1:2380
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d is starting a new election at term 3
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d became candidate at term 4
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 4
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: 8e9e05c52164694d became leader at term 4
Nov 21 03:30:07 example-104 etcd[6305]: raft2022/11/21 03:30:07 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 4
Nov 21 03:30:07 example-104 etcd[6305]: published {Name:infra0 ClientURLs:[https://127.0.0.1:2379]} to cluster cdf818194e3a8c32
Nov 21 03:30:07 example-104 etcd[6305]: ready to serve client requests
Nov 21 03:30:07 example-104 etcd[6305]: serving client requests on 127.0.0.1:2379
另外,如果使用第一种方式的话,可以只带下面参数
$ etcd --name infra0 --data-dir infra0 \
--cert-file=/path/to/server.crt --key-file=/path/to/server.key \
--advertise-client-urls=https://127.0.0.1:2379 --listen-client-urls=https://127.0.0.1:2379
验证
$ curl https://127.0.0.1:2379 # 直连
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
# 带上客户端密钥,恢复正常
$ curl --cacert /etc/etcd/ca.pem --cert /etc/etcd/client.pem --key /etc/etcd/client-key.pem https://127.0.0.1:2379/ -I
HTTP/2 404
access-control-allow-headers: accept, content-type, authorization
access-control-allow-methods: POST, GET, OPTIONS, PUT, DELETE
access-control-allow-origin: *
content-type: text/plain; charset=utf-8
x-content-type-options: nosniff
content-length: 19
date: Mon, 21 Nov 2022 03:52:44 GMT
集群部署
外置数据存储服务做完了,就该正式进行集群部署了
$ K3S_VERSION=v1.22.7+k3s1
$ curl -sfL https://get.k3s.io/ | K3S_TOKEN=SECRET \
INSTALL_K3S_VERSION=$K3S_VERSION sh -s - --disable=servicelb \
--write-kubeconfig-mode=0644 --no-deploy traefik \
--datastore-endpoint="https://127.0.0.1:2379" \
--datastore-cafile="/etc/etcd/ca.pem" \
--datastore-certfile="/etc/etcd/client.pem" \
--datastore-keyfile="/etc/etcd/client-key.pem"
$ cat /var/log/syslog # 可选,用于出错时候查看具体信息
$ kubectl get nodes # 查看集群就绪
NAME STATUS ROLES AGE VERSION
example-104 Ready control-plane,master 3m19s v1.22.7+k3s1
Etcd 一些快捷命令
既然是外置存储了,那肯定意味着需要单独运维该组件,可以配置一下相关指令,更多就不举例了
# 追加环境变量到 profle 或者相关配置中
$ vim ~/.zshrc
export ETCDCTL_ENDPOINTS='https://127.0.0.1:2379'
export ETCDCTL_CACERT='/etc/etcd/ca.pem'
export ETCDCTL_CERT='/etc/etcd/client.pem'
export ETCDCTL_KEY='/etc/etcd/client-key.pem'
export ETCDCTL_API=3
$ etcdctl member list -w table # 查看Etcd成员状态
+------------------+---------+--------+-----------------------+------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------+-----------------------+------------------------+------------+
| 8e9e05c52164694d | started | infra0 | http://localhost:2380 | https://127.0.0.1:2379 | false |
+------------------+---------+--------+-----------------------+------------------------+------------+
$ etcdctl check perf # 性能测试
60 / 60 Boooooooooooooooooooooo! 100.00% 1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.381208s
PASS: Stddev is 0.034952s
PASS
$ etcdctl endpoint status -w table
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://127.0.0.1:2379 | 8e9e05c52164694d | 3.4.22 | 24 MB | true | false | 8 | 17010 | 17010 | |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
$ etcdctl endpoint health -w table # 集群健康检查
+------------------------+--------+------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+------------------------+--------+------------+-------+
| https://127.0.0.1:2379 | true | 3.957756ms | |
+------------------------+--------+------------+-------+
针对更换IP后ETCD监听不启动问题
做了 ETCD 高可用后,如果我们对环境进行调整,例如更换 IP,会导致 ETCD 的监听出现异常,从而导致集群一直处于 api-server not ready
状态,我们可以使用下面方法手动调整
$ memberid=$(sudo ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' \
ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt' \
ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt' \
ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key' \
ETCDCTL_API=3 etcdctl member list | awk -F',' '{print $1}')
$ sudo ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' \
ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt' \
ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt' \
ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key' \
ETCDCTL_API=3 \
etcdctl member update $memberid --peer-urls="https://变更后的IP:2380"