公司的某一台即将交付的生产环境数据库处理异步请求忽然异常缓慢,而且同事反馈说安装软件也有卡顿情况。
问题排查流程
初步尝试
首先我尝试将 PgSQL 服务进行重启,因为服务的异步处理速度延迟达到了1秒之久,况且该服务的依赖性没有那么高
$ kubectl rollout restart -n databases statefulset postgres-postgresql
结果出现了麻烦,整个 Pods 一直处于 Terminating
状态,一开始还以为是 PreStop
任务没有处理结束导致堵塞,后来发现 PgSQL 的日志已经在几分钟之前就不输出了,看起来是集群没有进行处理。
$ kubectl get event -n databases
47m Warning Unhealthy pod/postgres-postgresql-0 Liveness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task dbd0aae87722bdababfdf720ec389aaeaab040f594b782451ea39edfbc1261bb not found: not found
47m Warning Unhealthy pod/postgres-postgresql-0 Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task dbd0aae87722bdababfdf720ec389aaeaab040f594b782451ea39edfbc1261bb not found: not found
通过查看错误事件即可发现,是清理 Container 的任务无法被正确处理,让我们手动清理一下
$ sudo k3s ctr c ls | grep postgres
d2a4902b4cef8355a3d62ff969e73d1fa8037235c0fb3c7165eeb9aacdec8236 docker.io/bitnami/postgresql:14.2.0-debian-10-r35 io.containerd.runc.v2
找到了底层的 Container,这时候直接删除是没有用的,集群会拒绝你的请求
$ ps aux | grep $(sudo k3s ctr task ls | grep d2a4902b4cef8355a3d62ff969e73d1fa8037235c0fb3c7165eeb9aacdec8236 | awk '{print $2}')
1001 520183 0.1 0.0 82456 16784 ? Ss 08:45 0:04 /opt/bitnami/postgresql/bin/postgres -D /bitnami/postgresql/data --config-file=/opt/bitnami/postgresql/conf/postgresql.conf --external_pid_file=/opt/bitnami/postgresql/tmp/postgresql.pid --hba_file=/opt/bitnami/postgresql/conf/pg_hba.conf
找到了实际映射到宿主机的任务 PID 并且确认了
$ sudo kill -9 520183
$ sudo k3s ctr c rm d2a4902b4cef8355a3d62ff969e73d1fa8037235c0fb3c7165eeb9aacdec8236
此时再返回集群查看,服务正常删除,然而重启后延迟的情况仍然没有缓解
小试牛刀
既然重启服务没有用,只能从机器本身下手了,简单用 iostat
看一下 IO 占用情况
$ iostat -xdm 2
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.02 0.00 0.00 120.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.48 11.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 46.50 2.04 0.00 0.00 115.57 44.90 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.37 93.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
loop7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 18.00 2.06 35.50 66.36 140.00 117.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.52 89.40
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sr0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
明显能看出来 89% 的负载都落到了 /dev/sda
这个盘上,至于前面的 dm-3
我推测是 LVM 将其挂载了 /dev/sda
上了,来看看是哪个挂载点那么高的负载
$ sudo dmsetup ls
ubuntu--vg-lv--0 (253:0)
ubuntu--vg-lv--1 (253:1)
ubuntu--vg-lv--2 (253:2)
ubuntu--vg-lv--3 (253:3)
ubuntu--vg-lv--4 (253:4)
ubuntu--vg-lv--5 (253:5)
$ sudo df -h | grep ubuntu--vg-lv--3
/dev/mapper/ubuntu--vg-lv--3 787G 37G 710G 5% /var/lib
$ sudo df -h | grep /var/lib
/dev/mapper/ubuntu--vg-lv--3 787G 37G 711G 5% /var/lib
tmpfs 32G 12K 32G 1% /var/lib/kubelet/pods/8bbcb717-74df-4d28-8000-05eb79889481/volumes/kubernetes.io~projected/kube-api-access-zhlwb
tmpfs
...
于是我们得出了一个初步判断,是集群本身占用了那么高负载,可以简单排除是病毒或者系统 BUG 导致的问题,可惜没办法判断是哪个 PVC,因为 K3s 的 local-path-provisioner CSI 不支持 Metrics 模式,但是仍然可以使用进程来推导,毕竟上面也通过杀死 PID 的方法反向干涉 Container
$ iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 2.06 M/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 2.22 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
472 be/3 root 0.00 B/s 1328.16 B/s ?unavailable? [jbd2/dm-0-8]
1072 be/4 root 0.00 B/s 12.97 K/s ?unavailable? k3s server
1309 be/4 root 0.00 B/s 3.89 K/s ?unavailable? containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml~ate /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
2236 be/4 root 0.00 B/s 0.00 B/s ?unavailable? containerd-shim-runc-v2 -namespace k8s.io -id efb0846028315d6940575~6b5cc33b8ef66edceaa3a0f -address /run/k3s/containerd/containerd.sock
2288 be/4 1001 0.00 B/s 27.24 K/s ?unavailable? redis-server *:6379
3371 be/4 1001 0.00 B/s 1328.16 B/s ?unavailable? beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 5~slog_error_logger false -kernel prevent_overlapping_partitions false
14330 be/4 1001 0.00 B/s 2.01 M/s ?unavailable? mongod --config=/opt/bitnami/mongodb/conf/mongodb.conf
215311 be/4 root 0.00 B/s 1328.16 B/s ?unavailable? java -Xmx512M -Xms32M -Duser.timezone=Asia/Shanghai -XX:TieredStopA~ervice.gateway.GatewayServerApplication --spring.profiles.active=mmf
536565 be/4 root 0.00 B/s 1328.16 B/s ?unavailable? python3.8 -m celery -A traffic_control worker -B -Q traffic_control~ntrol-85dfd8ddc5-dzngx --executable=/work/django_py3.8/bin/python3.8
612037 be/4 root 0.00 B/s 2.59 K/s ?unavailable? java -Xmx6000M -Xms1024M -Duser.timezone=Asia/Shanghai -XX:+HeapDum~ai.moying.cloud.service.CoreApplication --spring.profiles.active=mmf
1 be/4 root 0.00 B/s 0.00 B/s ?unavailable? init
2 be/4 root 0.00 B/s 0.00 B/s ?unavailable? [kthreadd]
iotop
只能作为一个参考,到了这一步已经可以 Mongo 高负载初步将结果同步给同事了,让同事先做下自查
进一步探查
仿照 SideCar 模式创建一个调试用的 Pod
$ kubectl run -i --tty debug -n databases --image=mongo --restart=Never -- sh
If you don't see a command prompt, try pressing enter
$ apt update
$ apt install netcat
$ mongosh "mongodb://mongo-mongodb:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.0"
Current Mongosh Log ID: 634542bf6b5210ca53845522
Connecting to: mongodb://mongo-mongodb:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.6.0
Using MongoDB: 4.4.13
Using Mongosh: 1.6.0
test> use admin
admin> db.auth("username","xxxx")
{ ok: 1 }
这样就登陆到了远程的 Mongo 中,做进一步的排查
admin> db.serverStatus().connections
{
current: 165,
available: 838695,
totalCreated: 234251,
active: 9,
exhaustIsMaster: 2,
exhaustHello: 6,
awaitingTopologyChanges: 8
}
admin>db.serverStatus().wiredTiger.cache['maximum bytes configured'] / 1024 / 1024 / 1024
15.0986328125
分别查看了连接数和 CacheSize,Mongo 通过控制 WiredTiger 存储引擎所占用的 CacheSize 进行内存交换。需要注意的是,CacheSize 设置较低,同时 Mongodb 复杂查询很频繁的话,会有延迟发生,通常来说,这个值为系统 RAM 的一半,即我们的宿主机内存为 32G
db.serverStatus().wiredTiger.concurrentTransactions
db.adminCommand({setParameter:1, wiredTigerEngineRuntimeConfig:'cache_size=4G'}) # 手动配置Size
性能分析
现在进行mongo的性能分析
admin> db.setProfilingLevel(2) # 开启性能分析模式
{ was: 0, slowms: 100, sampleRate: 1, ok: 1 }
admin>use core
core>db.system.profile.find({millis:{$gt:10}}).pretty(); # 查询1s以上的慢查询
[
{
op: 'command',
ns: 'core.device',
command: {
findAndModify: 'device',
query: { _id: '391a82cc-d825-44d1-8485-1c84b677b9b0' },
fields: {},
new: true,
update: {
'$set': {
'status.lidarData': [ { lidarType: 'MAIN', value: [Object] } ],
'status.lidarDataTimestamp': Long("1666077698408"),
statusTimestamp: Long("1666077698408"),
statusLatency: Long("27"),
operationStatus: 'IDLE',
operationStatusSorter: 0,
online: true,
modified: {
principalId: 'anonymous',
principalType: 'ANONYMOUS',
anonymous: true,
timestamp: Long("1666077698435")
}
}
},
'$db': 'core',
lsid: { id: new UUID("31d73b48-620c-4b1a-8683-c5527480e96f") }
},
keysExamined: 1,
docsExamined: 1,
nMatched: 1,
nModified: 1,
keysInserted: 1,
keysDeleted: 1,
numYield: 0,
locks: {
ParallelBatchWriterMode: { acquireCount: { r: Long("1") } },
ReplicationStateTransition: { acquireCount: { w: Long("1") } },
Global: { acquireCount: { w: Long("1") } },
Database: { acquireCount: { w: Long("1") } },
Collection: { acquireCount: { w: Long("1") } },
Mutex: { acquireCount: { r: Long("1") } }
},
flowControl: { acquireCount: Long("1"), timeAcquiringMicros: Long("2") },
responseLength: 289353,
protocol: 'op_msg',
millis: 42,
planSummary: 'IDHACK',
execStats: {
stage: 'UPDATE',
nReturned: 1,
executionTimeMillisEstimate: 42,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
nMatched: 1,
nWouldModify: 1,
wouldInsert: false,
inputStage: {
stage: 'IDHACK',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 1,
restoreState: 1,
isEOF: 1,
keysExamined: 1,
docsExamined: 1
}
},
ts: ISODate("2022-10-18T07:21:38.483Z"),
client: '10.42.0.211',
allUsers: [ { user: 'admin', db: 'admin' } ],
user: 'admin@admin'
},
{
op: 'command',
ns: 'core.device',
command: {
findAndModify: 'device',
query: { _id: '391a82cc-d825-44d1-8485-1c84b677b9b0' },
fields: {},
new: true,
update: {
'$set': {
'status.lidarData': [ { lidarType: 'MAIN', value: [Object] } ],
'status.lidarDataTimestamp': Long("1666077699070"),
statusTimestamp: Long("1666077699070"),
statusLatency: Long("34"),
operationStatus: 'IDLE',
operationStatusSorter: 0,
online: true,
modified: {
principalId: 'anonymous',
principalType: 'ANONYMOUS',
anonymous: true,
timestamp: Long("1666077699104")
}
}
},
'$db': 'core',
lsid: { id: new UUID("31d73b48-620c-4b1a-8683-c5527480e96f") }
},
keysExamined: 1,
docsExamined: 1,
nMatched: 1,
nModified: 1,
keysInserted: 1,
keysDeleted: 1,
numYield: 0,
locks: {
ParallelBatchWriterMode: { acquireCount: { r: Long("1") } },
ReplicationStateTransition: { acquireCount: { w: Long("1") } },
Global: { acquireCount: { w: Long("1") } },
Database: { acquireCount: { w: Long("1") } },
Collection: { acquireCount: { w: Long("1") } },
Mutex: { acquireCount: { r: Long("1") } }
},
flowControl: { acquireCount: Long("1"), timeAcquiringMicros: Long("3") },
responseLength: 289361,
protocol: 'op_msg',
millis: 24,
planSummary: 'IDHACK',
execStats: {
stage: 'UPDATE',
nReturned: 1,
executionTimeMillisEstimate: 22,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
nMatched: 1,
nWouldModify: 1,
wouldInsert: false,
inputStage: {
stage: 'IDHACK',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 1,
advanced: 1,
needTime: 0,
needYield: 0,
saveState: 1,
restoreState: 1,
isEOF: 1,
keysExamined: 1,
docsExamined: 1
}
},
ts: ISODate("2022-10-18T07:21:39.133Z"),
client: '10.42.0.211',
allUsers: [ { user: 'admin', db: 'admin' } ],
user: 'admin@admin'
}
]
可以看到这里的 command 表没有做索引,与 Device 表数据内容太大了,让同事加上索引与代码优化后,API 响应速度快了近一倍