kube-proxy 工作模式分析

我们知道kube-proxy支持 iptables 和 ipvs 两种模式, 在kubernetes v1.8 中引入了 ipvs 模式,在 v1.9 中处于 beta 阶段,在 v1.11 中已经正式可用了。iptables 模式在 v1.1 中就添加支持了,从 v1.2 版本开始 iptables 就是 kube-proxy 默认的操作模式,ipvs 和 iptables 都是基于netfilter的,那么 ipvs 模式和 iptables 模式之间有哪些差异呢?

  • ipvs 为大型集群提供了更好的可扩展性和性能
  • ipvs 支持比 iptables 更复杂的复制均衡算法(最小负载、最少连接、加权等等)
  • ipvs 支持服务器健康检查和连接重试等功能
  • 可以动态修改ipset集合。即使iptables的规则正在使用这个集合。

ipvs 依赖 iptables

由于ipvs 无法提供包过滤、SNAT、masquared(伪装)等功能。因此在某些场景(如Nodeport的实现)下还是要与iptables搭配使用,ipvs 将使用ipset来存储需要DROP或masquared的流量的源或目标地址,以确保 iptables 规则的数量是恒定的。假设要禁止上万个IP访问我们的服务器,则用iptables的话,就需要一条一条地添加规则,会在iptables中生成大量的规则;但是使用ipset的话,只需要将相关的IP地址(网段)加入到ipset集合中即可,这样只需要设置少量的iptables规则即可实现目标。

kube-proxy使用ipvs模式

在每台机器上安装依赖包:

1
[root@k8s-m1 ~]# yum install ipvsadm ipset sysstat conntrack libseccomp -y

所有机器选择需要开机加载的内核模块,以下是 ipvs 模式需要加载的模块并设置开机自动加载

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@k8s-m1 ~]# :> /etc/modules-load.d/ipvs.conf
module=(
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
br_netfilter
)
for kernel_module in ${module[@]};do
/sbin/modinfo -F filename $kernel_module |& grep -qv ERROR && echo $kernel_module >> /etc/modules-load.d/ipvs.conf || :
done
systemctl enable --now systemd-modules-load.service

上面如果systemctl enable命令报错可以systemctl status -l systemd-modules-load.service看看哪个内核模块加载不了,在/etc/modules-load.d/ipvs.conf里注释掉它再enable试试

所有机器需要设定/etc/sysctl.d/k8s.conf的系统参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@k8s-m1 ~]# cat <<EOF > /etc/sysctl.d/k8s.conf
# https://github.com/moby/moby/issues/31208
# ipvsadm -l --timout
# 修复ipvs模式下长连接timeout问题 小于900即可
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_announce = 2
net.ipv4.ip_forward = 1
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 2
# 要求iptables不对bridge的数据进行处理
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 1
net.netfilter.nf_conntrack_max = 2310720
fs.inotify.max_user_watches=89100
fs.may_detach_mounts = 1
fs.file-max = 52706963
fs.nr_open = 52706963
vm.swappiness = 0
vm.overcommit_memory=1
vm.panic_on_oom=0
EOF


[root@k8s-m1 ~]# sysctl --system

修改Kube-proxy配置文件将mode设置为ipvs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
hostnameOverride: k8s-m1
iptables:
masqueradeAll: true
masqueradeBit: 14
minSyncPeriod: 0s
syncPeriod: 30s
ipvs:
excludeCIDRs: null
minSyncPeriod: 0s
scheduler: ""
syncPeriod: 30s
kind: KubeProxyConfiguration
metricsBindAddress: 192.168.0.200:10249
mode: "ipvs"
nodePortAddresses: null
oomScoreAdj: -999
portRange: ""
resourceContainer: /kube-proxy
udpIdleTimeout: 250ms

创建 ClusterIP 类型服务时,IPVS proxier 将执行以下三项操作:

  • 确保节点中存在虚拟接口,默认为 kube-ipvs0
  • 将Service IP 地址绑定到虚拟接口
  • 分别为每个Service IP 地址创建 IPVS virtual servers

这是一个例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@k8s-m1 ~]# kubectl describe svc tomcat-service
Name: tomcat-service
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"tomcat-service","namespace":"default"},"spec":{"ports":[{"port":8...
Selector: app=tomcat
Type: ClusterIP
IP: 10.106.88.77
Port: <unset> 8080/TCP
TargetPort: 8080/TCP
Endpoints: 10.244.0.48:8080
Session Affinity: None
Events: <none>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@k8s-m1 ~]# ip -4 a
8: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
inet 10.96.0.10/32 brd 10.96.0.10 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.101.68.42/32 brd 10.101.68.42 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.1/32 brd 10.96.0.1 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.107.7.203/32 brd 10.107.7.203 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.106.88.77/32 brd 10.106.88.77 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.98.230.124/32 brd 10.98.230.124 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.103.49.63/32 brd 10.103.49.63 scope global kube-ipvs0
valid_lft forever preferred_lft forever
1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@k8s-m1 ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 172.17.0.1:30024 rr
-> 10.244.4.21:3000 Masq 1 0 0
TCP 192.168.0.200:30024 rr
-> 10.244.4.21:3000 Masq 1 0 0
TCP 192.168.0.200:30040 rr
-> 10.244.4.28:9090 Masq 1 0 0
TCP 10.96.0.1:443 rr
-> 192.168.0.200:6443 Masq 1 0 0
-> 192.168.0.201:6443 Masq 1 1 0
-> 192.168.0.202:6443 Masq 1 0 0

删除 Kubernetes Service将触发删除相应的 IPVS 虚拟服务器,IPVS 物理服务器及其绑定到虚拟接口的 IP 地址。

端口映射:

IPVS 中有三种代理模式:NAT(masq),IPIP 和 DR。 只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式进行端口映射。 以下示例显示 IPVS 服务端口8080到Pod端口80的映射。

1
2
3
4
5
6
7
8
TCP  10.107.7.203:8080 rr
-> 10.244.4.14:80 Masq 1 0 0
-> 10.244.4.15:80 Masq 1 0 0
-> 10.244.4.16:80 Masq 1 0 0
-> 10.244.4.20:80 Masq 1 0 0
-> 10.244.4.22:80 Masq 1 0 0
-> 10.244.4.23:80 Masq 1 0 0
-> 10.244.4.24:80 Masq 1 0 0

会话关系:

IPVS 支持客户端 IP 会话关联(持久连接)。 当服务指定会话关系时,IPVS 代理将在 IPVS 虚拟服务器中设置超时值(默认为180分钟= 10800秒)。 例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@k8s-m1 ~]# kubectl describe svc nginx-service
Name: nginx-service
...
IP: 10.102.128.4
Port: http 3080/TCP
Session Affinity: ClientIP


[root@k8s-m1 ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.102.128.4:3080 rr persistent 10800

ipvs proxier 将在以下5种情况下依赖于 iptables:

  • kube-proxy 设置 --masquerade-all = true
  • kube-proxy 设置 --cluster-cidr=<cidr>
  • Load Balancer 类型的 Service
  • NodePort 类型的 Service
  • 指定 externalIPs 的 Service

kube-proxy 配置参数 –masquerade-all=true

如果 kube-proxy 配置了--masquerade-all=true参数,则 ipvs 将伪装所有访问 Service 的 Cluster IP 的流量,此时的行为和 iptables 是一致的,由 ipvs 添加的 iptables 规则如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

[root@k8s-m1 ~]# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst

在 kube-proxy 启动时指定集群 CIDR

如果 kube-proxy 配置了–cluster-cidr=参数,则 ipvs 会伪装所有访问 Service Cluster IP 的外部流量,其行为和 iptables 相同,假设 kube-proxy 提供的集群 CIDR 值为:10.244.16.0/24,那么 ipvs 添加的 iptables 规则应该如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@k8s-m1 ~]# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (3 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ all -- !10.244.16.0/24 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst

Load Balancer 类型的 Service

对于loadBalancer类型的服务,ipvs 将安装匹配 KUBE-LOAD-BALANCER 的 ipset 的 iptables 规则。特别当服务的 LoadBalancerSourceRanges 被指定或指定 externalTrafficPolicy=local 的时候,ipvs 将创建 ipset 集合KUBE-LOAD-BALANCER-LOCAL/KUBE-LOAD-BALANCER-FW/KUBE-LOAD-BALANCER-SOURCE-CIDR,并添加相应的 iptables 规则,如下所示规则:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */

Chain KUBE-FIREWALL (1 references)
target prot opt source destination
RETURN all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOAD-BALANCER-SOURCE-CIDR dst,dst,src
KUBE-MARK-DROP all -- 0.0.0.0/0 0.0.0.0/0

Chain KUBE-LOAD-BALANCER (1 references)
target prot opt source destination
KUBE-FIREWALL all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOAD-BALANCER-FW dst,dst
RETURN all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOAD-BALANCER-LOCAL dst,dst
KUBE-MARK-MASQ all -- 0.0.0.0/0 0.0.0.0/0

Chain KUBE-MARK-DROP (1 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x8000

Chain KUBE-MARK-MASQ (2 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-LOAD-BALANCER all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOAD-BALANCER dst,dst
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOAD-BALANCER dst,dst

NodePort 类型的 Service

对于 NodePort 类型的服务,ipvs 将添加匹配KUBE-NODE-PORT-TCP/KUBE-NODE-PORT-UDP的 ipset 的iptables 规则。当指定externalTrafficPolicy=local时,ipvs 将创建 ipset 集KUBE-NODE-PORT-LOCAL-TC/KUBE-NODE-PORT-LOCAL-UDP并安装相应的 iptables 规则,如下所示:(假设服务使用 TCP 类型 nodePort)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000

Chain KUBE-NODE-PORT (1 references)
target prot opt source destination
RETURN all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-NODE-PORT-LOCAL-TCP dst
KUBE-MARK-MASQ all -- 0.0.0.0/0 0.0.0.0/0

Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-NODE-PORT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-NODE-PORT-TCP dst

指定 externalIPs 的 Service

对于指定了externalIPs的 Service,ipvs 会安装匹配KUBE-EXTERNAL-IP ipset 集的 iptables 规则,假设我们有指定了 externalIPs 的 Service,则 iptables 规则应该如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-EXTERNAL-IP dst,dst
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-EXTERNAL-IP dst,dst PHYSDEV match ! --physdev-is-in ADDRTYPE match src-type !LOCAL
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-EXTERNAL-IP dst,dst ADDRTYPE match dst-type LOCAL

IPVS 模式

入流量

入流量是指由集群外部访问 service 的流量。
Iptables 入流量的 chain 路径是 PREROUTING@nat -> INPUT@nat。

ClusterIP

Iptables 入流量的 chain 路径是 PREROUTING@nat -> INPUT@nat。
在 PREROUTING 阶段,流量跳转到 KUBE-SERVICES target chain:

1
2
3
4
5
[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL

KUBE-SERVICES chain 如下:

1
2
3
4
5
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-MARK-MASQ all -- !10.244.0.0/16 0.0.0.0/0 /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
KUBE-NODE-PORT all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst

ClusterIP service 的访问流量会交由 KUBE-MARK-MASQ处理,其匹配规则是匹配内核中名为 KUBE-CLUSTER-IP 的 ipset(将源地址不是10.244.0.0/16的IP交由KUBE-MARK-MASQ)。

下一步就是为这些包打上标记:

1
2
3
Chain KUBE-MARK-MASQ (3 references)
target prot opt source destination
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000

此时封包在 iptables 的路径已经走完,并没有后续的 DNAT 到后端 endpoint 的流程,这一步的工作交由 IPVS 来完成。(这步是仅仅给ipset list中的KUBE-CLUSTER-IP 添加了一个标签0x4000,有此标记的数据包会在KUBE-POSTROUTING chain中统一做MASQUERADE)

检查 ipvs 代理规则

用户可以使用ipvsadm工具检查 kube-proxy 是否维护正确的 ipvs 规则,比如,我们在集群中有以下一些服务:

1
2
3
4
# kubectl get svc --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 1d
kube-system kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 1d

我们可以得到如下的一些 ipvs 代理规则:

1
2
3
4
5
6
7
8
9
10
 # ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.0.0.1:443 rr persistent 10800
-> 192.168.0.1:6443 Masq 1 1 0
TCP 10.0.0.10:53 rr
-> 172.17.0.2:53 Masq 1 0 0
UDP 10.0.0.10:53 rr
-> 172.17.0.2:53 Masq 1 0 0

出流量

出流量是指由集群内的 pod 访问 service 的流量。
Iptables 出流量的 chain 路径是 OUTPUT@nat -> POSTROUTING@nat。
OUTPUT chain 如下,与入流量情形一样,也是所有流量跳转到 KUBE-SERVICES chain:

1
2
3
4
5
[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL

而后的动作与入流量一致,不论 ClusterIP service 还是 NodePort service,都是为封包打上 0x4000 的标记。区别是至此入流量的 iptables 流程走完,而出流量还需要经过 nat 表的 POSTROUTING chain,其定义如下:

1
2
3
4
5
6
7
8
9
10
[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
MASQUERADE all -- 233.233.5.0/24 0.0.0.0/0
RETURN all -- 10.244.0.0/16 10.244.0.0/16
MASQUERADE all -- 10.244.0.0/16 !224.0.0.0/4
RETURN all -- !10.244.0.0/16 10.244.16.0/24
MASQUERADE all -- !10.244.0.0/16 10.244.0.0/16

进一步跳转到 KUBE-POSTROUTING chain:Chain KUBE-POSTROUTING (1 references)

1
2
3
4
5
[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose */ match-set KUBE-LOOP-BACK dst,dst,src

在这里,会为之前打上 0x4000 标记的出流量封包执行 MASQUERADE target,即类似于 SNAT 的一种操作,将其来源 IP 变更为 ClusterIP 或 Node ip。

被打了标记的流量处理方式

1
2
3
4
5
6
7
8
9
10
11
[root@k8s-n-1920168091021 overlord]# iptables -L -n

Chain KUBE-FIREWALL (2 references)
target prot opt source destination
DROP all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000

Chain KUBE-FORWARD (1 references)
target prot opt source destination
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT all -- 10.244.0.0/16 0.0.0.0/0 /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT all -- 0.0.0.0/0 10.244.0.0/16 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

ipset命令使用(iptables 中 match-set 匹配的就是就这里的地址 )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[root@k8s-n-1920168091021 overlord]# ipset list
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 2
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 17584
References: 2
Members:
10.105.97.136,tcp:80
10.105.235.100,tcp:8452
10.98.53.160,tcp:80
10.97.204.141,tcp:80
10.108.115.91,tcp:80
10.98.118.117,tcp:80
10.96.0.1,tcp:443
10.101.26.124,tcp:443
10.98.88.140,tcp:8080
10.108.210.26,tcp:3306
10.96.0.10,tcp:9153
10.96.164.37,tcp:443
10.109.162.103,tcp:80
10.110.237.2,tcp:80
10.101.206.6,tcp:7030
10.111.154.57,tcp:8451
10.110.94.131,tcp:1111
10.98.146.210,tcp:7020
10.103.144.159,tcp:44134
10.96.0.10,tcp:53
10.98.88.140,tcp:8081
10.100.77.215,tcp:80
10.111.2.26,tcp:80
10.104.58.177,tcp:2181
10.97.58.7,tcp:80
10.111.11.67,tcp:8080
10.109.196.230,tcp:9090
10.98.39.12,tcp:5672
10.98.254.44,tcp:6379
10.96.0.10,udp:53
10.100.189.66,tcp:80
10.96.160.63,tcp:7010
10.97.217.217,tcp:3306

官方文档:https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/README.md

网友们遇到的坑:

使用ab测试性能进行测试。结果ab跑了没几个请求,K8S的机器就报错了。
kernel: nf_conntrack: table full, dropping packet

这也算经典的错误了,查了下nf_conntrack_max只有131072,肯定是不够的,CentOS7.3默认应该是65536*4=262144。肯定是有地方改动这个值了,查了一圈没找到,最后看了下Kube-proxy的日志,结果还真是它改的!

1
2
3
4
5
6
7
8
9
[root@k8s-m-1 ~]# kubectl logs kube-proxy-q2s4h -n kube-system
W0110 09:32:36.679540 1 server_others.go:263] Flag proxy-mode="" unknown, assuming iptables proxy
I0110 09:32:36.681946 1 server_others.go:117] Using iptables Proxier.
I0110 09:32:36.699112 1 server_others.go:152] Tearing down inactive rules.
I0110 09:32:36.860064 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0110 09:32:36.860138 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0110 09:32:36.860192 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0110 09:32:36.860230 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0110 09:32:36.860480 1 config.go:102] Starting endpoints config controller

寻找罪魁祸首

翻看了一下源代码,发现这是一个预设值,在kube-proxy的参数里可以找到。

  • –conntrack-max-per-core int32 Maximum number of NAT connections to track per CPU core (0 to leave the limit as-is and ignore conntrack-min). (default 32768) 每个核默认32768个,总数就是32768*CPU核数
  • –conntrack-min int32 Minimum number of conntrack entries to allocate, regardless of conntrack-max-per-core (set conntrack-max-per-core=0 to leave the limit as-is). (default 131072) 最小值是131072个,CPU核数低于或者等于4,默认是131072

解决

  • 找到原因了,如何修改kube-proxy的参数呢?(kube-proxy参数)
    –conntrack-min=1048576
  • 增加如下值到 sysctl.conf中,kube-proxy 默认会调整到 131072(系统内核参数)
    net.netfilter.nf_conntrack_max=1048576
    net.nf_conntrack_max=1048576

IPVS引发的TCP超时问题定位

1
2
3
4
5
6
7
8
9
10
[root@k8s-n-192016801100151 overlord]# ipvsadm -Lnc       
IPVS connection entries
pro expire state source virtual destination
TCP 14:46 ESTABLISHED 192.168.20.163:38150 192.168.110.151:30401 10.244.17.24:80
TCP 01:46 TIME_WAIT 192.168.20.163:37798 192.168.110.151:30401 10.244.17.24:80
TCP 00:01 TIME_WAIT 192.168.20.163:37150 192.168.110.151:30401 10.244.18.30:80
TCP 13:57 ESTABLISHED 192.168.20.163:37890 192.168.110.151:30401 10.244.18.30:80
TCP 14:59 ESTABLISHED 192.168.20.163:38218 192.168.110.151:30401 10.244.18.30:80
TCP 00:51 TIME_WAIT 192.168.20.163:37442 192.168.110.151:30401 10.244.18.30:80
TCP 00:46 TIME_WAIT 192.168.20.163:37424 192.168.110.151:30401 10.244.17.24:80
1
2
[root@k8s-n-192016801100151 overlord]# ipvsadm -l --timeout
Timeout (tcp tcpfin udp): 900 120 300

基本确定了问题, 看起来是 ipvs 维护 VIP的这条链接存在15min左右的超时阈值设定,这个值是否跟系统默认的tcp_keepalive_timeout 有协同影响? 那么系统的默认tcp超时时间是多少呢?
ipvs维护链接有个超时时间,默认为900s为15分钟;然后操作系统默认的tcp_keepalive_timeout 默认为7200s,当一个空闲 tcp连接达到900s时,首先他被ipvs断了,但是操作系统认为该链接还没有到保活超时,所以客户端还会使用之前的连接去发送查询请求,但是ipvs已经不维护该链接了,所以 Lost Connection。。所以只要减小系统的tcp_keepalive_timeout时间,比如到600,后发送一个心跳包,让tcp保活, 这样, ipvs的连接超时也会被重置计数为15min。

新增如下参数

  • 表示当Keepalive起用的时候,TCP发送keepalive消息的频繁度。预设值是2小时,这里我改为5分钟。
    net.ipv4.tcp_keepalive_time = 600
  • 总共发送keepalive的次数
    net.ipv4.tcp_keepalive_probes = 10
  • 每次发送keepalive间隔单位S
    net.ipv4.tcp_keepalive_intvl = 30

当启用与内核参数或守护程序端配置或客户端配置相关的选项时,它将根据这些选项终止tcp会话。例如,当您将以上述内核参数选项视为示例时,首先将在600秒后开始发送keepalive数据包,之后每隔30秒发送一次下一个数据包10次。当客户端或服务器在这段时间内根本没有应答时,tcp会话将被视为已损坏,并将终止。为什么我们要设置为600s呢, 其实只要比 ipvs的默认值900小即可!


kube-proxy 工作模式分析
https://system51.github.io/2019/08/26/kube-proxy/
作者
Mr.Ye
发布于
2019年8月26日
许可协议