Talos HostDNS and Calico

Ever since upgrading from Talos 1.6.x to Talos 1.7.1 I had wanted to enable the hostDNS feature together with forwardKubeDNSToHost. Given that this will be the default for Talos 1.8.x once released, there is also minor urgency associated with this.

Unfortunately, every time I attempted this, I ran into weird issues because the pods inside my cluster suddenly encountered issues with name resolution. When I tried it the last time, I noticed in the CoreDNS logs that communication with the DNS server on the host for some reason was not possible. Back then, I blamed that on Calico - I am not sure how many people are using Calico on top of Talos. But this morning I thought: “Maybe this is somehow related to that whole topic of certificates I just ran into…”

Latest Attempt

I just made another attempt: went through the nodes and enabled forwardKubeDNSToHost, restarted call coredns pods. Same result as before:

[ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:54822->10.96.0.9:53: i/o timeout
[INFO] 127.0.0.1:59243 - 63620 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 4.002302478s
[ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:52801->10.96.0.9:53: i/o timeout
[INFO] 127.0.0.1:58276 - 32757 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.000879759s
[ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:51477->10.96.0.9:53: i/o timeout
[INFO] 127.0.0.1:60880 - 42081 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.000720577s
[ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:53465->10.96.0.9:53: i/o timeout
[INFO] 127.0.0.1:35883 - 18883 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.002205095s
[ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:55449->10.96.0.9:53: i/o timeout
[INFO] 127.0.0.1:58355 - 29032 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.001254107s

So apparently there is some communication issue going on here. Looking at the Talos issue tracker on GitHub, I found Enabling forwardKubeDNSToHost results in i/o timeout which is describing a similar issue with Cillium. (Note to myself: looking for other people with the same problem works) Reading through that issue, at least for Cilium the issue seems to be related to eBPF - something that I have also enabled for Calico. Unfortunately, there is no seeting that sounds vaguely similar to the one that helped fixing this with Cilium. The only thing that I remembered was turning on Direct Server Return (DSR) when enabling eBPF, but changing that back to Tunnel did not fix the problem.

For now, I’ve configured all the nodes to explicitly set forwardKubeDNSToHost to false so that I do not run into any issue when upgrading to Talos 1.8.x…

Last modified: 29 July 2024