Etcd 备份恢复

配置Etcd环境变量

/etc/profile.d目录下创建etcd.sh文件将如下内容写入。

1
2
3
4
5
6
7
8
9
10
11
alias etcd_v2='etcdctl --cert-file /etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key-file /etc/kubernetes/pki/etcd/healthcheck-client.key \
--ca-file /etc/kubernetes/pki/etcd/ca.crt \
--endpoints https://10.15.1.3:2379,https://10.15.1.4:2379,https://10.15.1.5:2379'

alias etcd_v3='ETCDCTL_API=3 \
etcdctl \
--cert /etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key /etc/kubernetes/pki/etcd/healthcheck-client.key \
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--endpoints https://10.15.1.3:2379,https://10.15.1.4:2379,https://10.15.1.5:2379'

使用命令查看Etcd API版本

1
2
3
[root@k8s-m1 profile.d]# etcd_v3 version
etcdctl version: 3.2.24
API version: 3.2

Etcd备份

1
2
[root@k8s-m1 overlord]# etcd_v3 snapshot save test.db
Snapshot saved at test.db

Etcd还原

停止etcd(停止所有的Etcd服务)

1
[root@k8s-m1 overlord]# systemctl stop etcd

删除etcd数据目录

1
[root@k8s-m1 overlord]# rm -rf /var/lib/etcd

分发备份文件

1
2
3
[root@k8s-m1 overlord]# scp test.db overlord@10.15.1.3:/home/overlord/
[root@k8s-m1 overlord]# scp test.db overlord@10.15.1.4:/home/overlord/
[root@k8s-m1 overlord]# scp test.db overlord@10.15.1.5:/home/overlord/

恢复备份(在所有Etcd主机执行)

1
2
3
4
5
ETCDCTL_API=3 etcdctl  \
--name k8s-m1 --initial-advertise-peer-urls https://10.15.1.3:2380 \
--initial-cluster k8s-m1=https://10.15.1.3:2380,k8s-m2=https://10.15.1.4:2380,k8s-m3=https://10.15.1.5:2380 \
snapshot restore /root/test.db \
--data-dir=/var/lib/etcd/
1
2
3
4
5
ETCDCTL_API=3 etcdctl  \
--name k8s-m2 --initial-advertise-peer-urls https://10.15.1.4:2380 \
--initial-cluster k8s-m1=https://10.15.1.3:2380,k8s-m2=https://10.15.1.4:2380,k8s-m3=https://10.15.1.5:2380 \
snapshot restore /root/test.db \
--data-dir=/var/lib/etcd/
1
2
3
4
5
ETCDCTL_API=3 etcdctl  \
--name k8s-m3 --initial-advertise-peer-urls https://10.15.1.5:2380 \
--initial-cluster k8s-m1=https://10.15.1.3:2380,k8s-m2=https://10.15.1.4:2380,k8s-m3=https://10.15.1.5:2380 \
snapshot restore /root/test.db \
--data-dir=/var/lib/etcd/

启动Etcd服务

1
systemctl start etcd

查询Etcd状态

1
2
3
4
5
6
7
8
[root@k8s-m1 lib]# etcd_v3 --write-out=table endpoint status
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.15.1.3:2379 | daad158099a5cdca | 3.2.24 | 9.7 MB | true | 2 | 218 |
| https://10.15.1.4:2379 | 2717bbfe0f23b647 | 3.2.24 | 9.7 MB | false | 2 | 218 |
| https://10.15.1.5:2379 | 81183085587ceed2 | 3.2.24 | 9.7 MB | false | 2 | 218 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

查询Kubernetes集群是否可用

1
2
3
4
5
6
7
[root@k8s-m1 lib]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-m1 Ready master 112d v1.13.5 10.15.1.3 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://18.6.3
k8s-m2 Ready master 112d v1.13.5 10.15.1.4 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://18.6.3
k8s-m3 Ready master 112d v1.13.5 10.15.1.5 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://18.6.3
k8s-n1 Ready node 112d v1.13.5 10.15.1.6 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://18.6.3
k8s-n2 Ready node 112d v1.13.5 10.15.1.7 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://18.6.3

Etcd数据压缩

etcd默认不会自动进行数据压缩,etcd保存了keys的历史信息,数据频繁的改动会导致数据版本越来越多,相对应的数据库就会越来越大。etcd数据库大小默认2GB,当在etcd出现以下日志时,说明数据库空间占满,需要进行数据压缩腾出空间。

1
Error from server: etcdserver: mvcc: database space exceeded

获取历史版本号:(在Etcd主机执行以下命令)

1
[root@k8s-m1 overlord]# ver=$(ETCDCTL_API=3 etcdctl --write-out="json" endpoint status | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')

压缩旧版本(在所有Etcd主机执行)

1
[root@k8s-m1 overlord]# ETCDCTL_API=3 etcdctl compact $ver

清理碎片(在所有Etcd主机执行)

1
[root@k8s-m1 overlord]# ETCDCTL_API=3 etcdctl defrag

查看下压缩后的大小

1
2
3
4
5
6
7
8
[root@k8s-m1 overlord]# etcd_v3 -w=table endpoint status
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.15.1.3:2379 | daad158099a5cdca | 3.2.24 | 4.1 MB | true | 2 | 9044 |
| https://10.15.1.4:2379 | 2717bbfe0f23b647 | 3.2.24 | 5.6 MB | false | 2 | 9044 |
| https://10.15.1.5:2379 | 81183085587ceed2 | 3.2.24 | 5.5 MB | false | 2 | 9044 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+

忽略etcd告警

通过执行ETCDCTL_API=3 etcdctl alarm list可以查看etcd的告警情况,如果存在告警,即使释放了etcd空间,etcd也处于只读状态。在确定以上的操作均执行完毕后在任意一个etcd主机中执行以下命令忽略告警:

1
[root@k8s-m1 overlord]# ETCDCTL_API=3 etcdctl alarm disarm

故障排查

在本实验中还原后可能会出现如下故障,所有Pod在删除后重新创建全部为Pengding状态。

etcd-1

通过查看describe pod看到Events并没有任何事件。
etcd-2

查看scheduler日志信息发现连接api server报错
etcd-3

查看api server日志发现报如下错误the object has been modified,please apply your change to the latest versionand try again 删除metrics server API server即可恢复,此时执行kubectl api-resources直接hang住无任何反应。
etcd-4

Namespaces名称空间无法删除

1
2
3
4
5
6
7
8
9
10
11
12
[root@k8s-m1 ~]# kubectl get ns
NAME STATUS AGE
cattle-prometheus Active 29d
cattle-system Terminating 29d
cj-test Active 14d
cxx1 Active 79d
default Active 118d
ingress-nginx Active 39d
iov-paas Active 78d
iov-test Active 104d
istio-app-test Active 35d
istio-system Active 54d

开启api server HTTP代理

1
2
[root@k8s-m1 ~]# kubectl proxy
Starting to serve on 127.0.0.1:8001

克隆终端并在终端中声明一个变量

1
[root@k8s-m1 overlord]# NAMESPACE=cattle-system

将无法删除的namespaces以Json格式输出

1
kubectl get namespace $NAMESPACE -o json |jq '.spec = {"finalizers":[]}' >temp.json

编辑temp.json文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"annotations": {
"cattle.io/status": "{\"Conditions\":[{\"Type\":\"ResourceQuotaInit\",\"Status\":\"True\",\"Message\":\"\",\"LastUpdateTime\":\"2019-08-19T08:21:42Z\"},{\"Type\":\"InitialRolesPopulated\",\"Status\":\"True\",\"Message\":\"\",\"LastUpdateTime\":\"2019-08-19T08:21:47Z\"}]}",
"field.cattle.io/projectId": "c-zcf9v:p-lwgnw",
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Namespace\",\"metadata\":{\"annotations\":{},\"name\":\"cattle-system\"}}\n",
"lifecycle.cattle.io/create.namespace-auth": "true"
},
"creationTimestamp": "2019-08-19T08:21:31Z",
"deletionGracePeriodSeconds": 0,
"deletionTimestamp": "2019-09-17T10:15:32Z",
"finalizers": [
"controller.cattle.io/namespace-auth"
],
"labels": {
"field.cattle.io/projectId": "p-lwgnw"
},
"name": "cattle-system",
"resourceVersion": "16595026",
"selfLink": "/api/v1/namespaces/cattle-system",
"uid": "59412a0d-c25a-11e9-ba0d-00505680cc08"
},
"spec": {
"finalizers": []
},
"status": {
"phase": "Terminating"
}
}

删除:"resourceVersion": "16595026",整行内容,删除:"finalizers" []中的内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"annotations": {
"cattle.io/status": "{\"Conditions\":[{\"Type\":\"ResourceQuotaInit\",\"Status\":\"True\",\"Message\":\"\",\"LastUpdateTime\":\"2019-08-19T08:21:42Z\"},{\"Type\":\"InitialRolesPopulated\",\"Status\":\"True\",\"Message\":\"\",\"LastUpdateTime\":\"2019-08-19T08:21:47Z\"}]}",
"field.cattle.io/projectId": "c-zcf9v:p-lwgnw",
"kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Namespace\",\"metadata\":{\"annotations\":{},\"name\":\"cattle-system\"}}\n",
"lifecycle.cattle.io/create.namespace-auth": "true"
},
"creationTimestamp": "2019-08-19T08:21:31Z",
"deletionGracePeriodSeconds": 0,
"deletionTimestamp": "2019-09-17T10:15:32Z",
"finalizers": [],
"labels": {
"field.cattle.io/projectId": "p-lwgnw"
},
"name": "cattle-system",
"selfLink": "/api/v1/namespaces/cattle-system",
"uid": "59412a0d-c25a-11e9-ba0d-00505680cc08"
},
"spec": {
"finalizers": []
},
"status": {
"phase": "Terminating"
}
}

执行如下操作

1
curl -k -H "Content-Type: application/json" -X PUT --data-binary @temp.json 127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize

执行kubectl查看结果

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@k8s-m1 overlord]#kubectl get ns
NAME STATUS AGE
cattle-prometheus Active 29d
cj-test Active 14d
cxx1 Active 79d
default Active 118d
ingress-nginx Active 39d
iov-paas Active 78d
iov-test Active 104d
istio-app-test Active 35d
istio-system Active 54d
kube-public Active 118d
kube-system Active 118d

若以上方法无法删除,可使用第二种方法,直接从ETCD中删除源数据

1
2
3
4
5
# 删除default namespace下的pod名为pod-to-be-deleted-0
ETCDCTL_API=3 etcdctl del /registry/pods/default/pod-to-be-deleted-0

# 删除需要删除的NAMESPACE
ETCDCTL_API=3 etcdctl del /registry/namespaces/NAMESPACENAME

Etcd 备份恢复
https://system51.github.io/2019/09/11/etcd-Disaster-recovery/
作者
Mr.Ye
发布于
2019年9月11日
许可协议