问题背景
之前发生过staragent异常,导致云效流水线部署失败:
回想起来,是在开启microk8s之后才发生的。于是怀疑是microk8s导致。
kubectl describe node izwz9h8m2chowowqckbcy0z
在Events看到几个SystemOOM警告。
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 5m57s kube-proxy, izwz9h8m2chowowqckbcy0z Starting kube-proxy.
Normal Starting 5m46s kubelet, izwz9h8m2chowowqckbcy0z Starting kubelet.
Warning InvalidDiskCapacity 5m46s kubelet, izwz9h8m2chowowqckbcy0z invalid capacity 0 on image filesystem
Warning SystemOOM 5m46s kubelet, izwz9h8m2chowowqckbcy0z System OOM encountered, victim process: pip3, pid: 23143
Warning SystemOOM 5m46s kubelet, izwz9h8m2chowowqckbcy0z System OOM encountered, victim process: pip3, pid: 24161
Warning SystemOOM 5m46s kubelet, izwz9h8m2chowowqckbcy0z System OOM encountered, victim process: pip3, pid: 25349
从dmesg看,这个pip3进程申请了2G+的vm(what!!!)。
# dmesg | grep 23143
[1375956.718910] [23143] 0 23143 528095 435853 3997696 39263 0 pip3
[1375956.718916] Out of memory: Kill process 23143 (pip3) score 613 or sacrifice child
[1375956.719841] Killed process 23143 (pip3) total-vm:2112380kB, anon-rss:1743412kB, file-rss:0kB, shmem-rss:0kB
[1375957.190126] oom_reaper: reaped process 23143 (pip3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
不过很可惜,不是staragent相关进程。最初的假设不成立。 疑惑的是,swap内存空闲率很高。直接触发SystemOOM,太粗暴了。
# swapon
NAME TYPE SIZE USED PRIO
/swapfile file 947.2M 18.6M -2
/swapfile.new file 2G 48K -3
于是查阅一些资料,了解kubernetes对swap的使用情况。
--fail-swap-on
默认情况下,系统开启swap会导致k8s启动失败。除非使用--fail-swap-on参数(kubelet)。
–fail-swap-on Makes the Kubelet fail to start if swap is enabled on the node. (default true) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet’s –config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
如果为true(默认值)就要求必须要关闭swap,false是表示即使宿主开启了swap,kubelet也是可以成功启动,但是pod是允许使用swap了。
关于 swap memory 的讨论
2017年的这个issue讨论了swap问题:Kubelet/Kubernetes should work with Swap Enabled #53533。总结起来有这几点:
- 开启swap,会使得内存限额和pod调度变得复杂。怎样衡量swap配置?调度器要怎样根据swap去调度?
having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would then start spilling over into swap
- kubernetes不是为了swap而设计。由于pod使用内存的复杂性,kubernetes缺少一个足够聪明的策略去协调不同pod对内存/swap的使用。k8s官方对此觉得产出投入比不高,不如把时间花在提高稳定性上。
Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods. We discussed this topic at the resource mgmt face to face earlier this year. We are not super interested in tackling this in the near term relative to the gains it could realize.
- 在实际应用场景,如果不开启swap,应用就要为峰值内存申请内存资源,可能导致资源浪费。
We have a cron job that occasionally runs into high memory usage (>30GB) and we don’t want to permanently allocate 40+GB nodes.
- v1.8后默认不用了swap。官方不推荐使用swap。开启swap后果自负,官方不背锅。
折中的做法
像批处理job这种容易产生高峰值内存的app,就要考虑开启swap。