Calico

More Networking Woes...

More Problems After fixing the issues introduced by enabling the Talos ingress firewall, I quickly realized that I wasn’t completely back to normal yet when I noticed that some of the deployed services had problems connecting to resources on the local network. The first thing that I noticed there was this seemed to be limited to pods with an address from the Pod CIDR; pods that had host networking configured seemed to work properly. Assuming that this was somehow related to the recent changes, I revisited all the changes I made as part of the ingress firewall configuration and the unfortunate re-deployment of Calico: The talhelper configuration, also switching off the firewall Both the Tigera and Felix configurations, disabling and enabling BPF Checked for any form of Global Network Policy that might be causing the problem In addition to that, I also rebooted the whole Cluster - luckily I am still pretty much in the set up phase - as well as rebooting my internet gateway. Nothing. Finding the Issue After some googling, I was about to create an issue over at Calico’s GitHub for which I wanted to provide as much information as possible. When I started to collect that information, I realized that the problem was limited to one subnet in particular. All the other local networks were accessible just fine. Spending some time thinking about what’s special about this particular network, I decided to look at all the network related configurations in the cluster, when I found a MetalLB ipaddresspool that was clearly exposing IP addresses out of that problematic network CIDR - I had set that up months ago when I wanted to expose services as part of that network. Luckily, none of those were in use and I could just delete that configuration. Because I configured MetalLB in such a way that it relies on Calico to do the BGP announcements, I knew that I had also modify the bgpconfiguration for Calico. When I was looking at that, I noticed that it announced 192.168.1.0/24 - the problematic network - via the serviceLoadBalancerIPs property. I removed that as well and everything started to work again almost instantaneously. What’s Bothering As always, I am glad I could resolve the issue all by myself, as this is the best way to learn things. However, in this case I don’t feel as if the problem is understood completely. The configuration I had to remove has been that way for months. It didn’t cause any issues through Calico updates, cluster reboots, or even complete restarts of network routers and switches. So it is still a mystery to me, why this has been causing issues all of a sudden.

Talos Ingress Firewall and Calico...

So I’ve been trying to set up the Ingress Firewall for Talos as described in the documentation. To play it safe, I started with the firewall for the worker nodes, since it would be easier to recover from any problem if I screwed up the configuration. Initilly, everything looked just fine and I rolled out the changes to all the nodes. Again, I was wrong… Turns out that it’s especially important to read the documentation carefully, especially when working with firewalls. While it had been pretty clear about where to use CLUSTER_SUBNET, I had misinterpreted it as CLUSTER_CIDR which obviously did not yield the same result. Once that was fixed - which required resetting all worker nodes - I was ready to also apply the firewall changes to the control plane nodes. Upgrading Calico by Mistake Today, luckily just a day after my firewall work, I’ve been working on my ArgoCD configuration, which resulted in an update to my Calico configuration. When that happens, the Tigera operator will its DaemonSets, everything will restart, and that did not end up in a happy place. Problem 1: bird: Unable to open configuration file… The first problem, that I ran into was calico-node pods not comming up properly. Looking at the logs, I noticed the following: /etc/calico/confd/config/bird.cfg: No such file or directory /etc/calico/confd/config/bird6.cfg: No such file or directory Those lines were output a lot. After some googling, everything was pointing to the Typha instances. While they seemed to be running properly, I notived log output that indicated a communication problems. Why? Because calico-node would try to contact Typha via its CLUSTER_SUBNET address on port 5473, which was blocked by the firewall rules. Adding another rule for both worker and control plane nodes fixed that, and I though I’d be done. Not so fast… Problem 2: Some calico-nodes not becoming ready… The second thing that happened was calico-node pods not becoming ready on some of my cluster nodes. Since I am a heavy user of k9s that was easily spotted because those items were just stuck being shown in red. When I looked at the Readiness Probe configuration, I noticed that the followng command would be executed to determine the state: $ calico-node -felix-ready -bird-ready Shelling into one of the non-ready pods, I was able to get some output that helped move things into the right direction: BGP. The Bird process is using BGP between the different Calico nodes, and that again requires another port to be opened up: the one for BGP, which is 179. Summary While the documentation for the Talos IngressFirewall points out that UDP 4789 needs to be opened up in order for Calico’s VXLAN to work, unfortunately it does not mention TCP 179 for BGP, not 5473 for Typha.

Talos HostDNS and Calico

Ever since upgrading from Talos 1.6.x to Talos 1.7.1 I had wanted to enable the hostDNS feature together with forwardKubeDNSToHost. Given that this will be the default for Talos 1.8.x once released, there is also minor urgency associated with this. Unfortunately, every time I attempted this, I ran into weird issues because the pods inside my cluster suddenly encountered issues with name resolution. When I tried it the last time, I noticed in the CoreDNS logs that communication with the DNS server on the host for some reason was not possible. Back then, I blamed that on Calico - I am not sure how many people are using Calico on top of Talos. But this morning I thought: “Maybe this is somehow related to that whole topic of certificates I just ran into…” Latest Attempt I just made another attempt: went through the nodes and enabled forwardKubeDNSToHost, restarted call coredns pods. Same result as before: [ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:54822->10.96.0.9:53: i/o timeout [INFO] 127.0.0.1:59243 - 63620 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 4.002302478s [ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:52801->10.96.0.9:53: i/o timeout [INFO] 127.0.0.1:58276 - 32757 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.000879759s [ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:51477->10.96.0.9:53: i/o timeout [INFO] 127.0.0.1:60880 - 42081 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.000720577s [ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:53465->10.96.0.9:53: i/o timeout [INFO] 127.0.0.1:35883 - 18883 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.002205095s [ERROR] plugin/errors: 2 9156655801114656930.4985833156107857953. HINFO: read udp 10.244.151.132:55449->10.96.0.9:53: i/o timeout [INFO] 127.0.0.1:58355 - 29032 "HINFO IN 9156655801114656930.4985833156107857953. udp 57 false 512" - - 0 2.001254107s So apparently there is some communication issue going on here. Looking at the Talos issue tracker on GitHub, I found Enabling forwardKubeDNSToHost results in i/o timeout which is describing a similar issue with Cillium. (Note to myself: looking for other people with the same problem works) Reading through that issue, at least for Cilium the issue seems to be related to eBPF - something that I have also enabled for Calico. Unfortunately, there is no seeting that sounds vaguely similar to the one that helped fixing this with Cilium. The only thing that I remembered was turning on Direct Server Return (DSR) when enabling eBPF, but changing that back to Tunnel did not fix the problem. For now, I’ve configured all the nodes to explicitly set forwardKubeDNSToHost to false so that I do not run into any issue when upgrading to Talos 1.8.x…

Talos Cluster Lost Connectivity...

Why the nodes in my Talos cluster could not talk to each other anymore...