TuringPi

New Talos Nodes Not Responding

After my positive experiecens with both the TuringPi and Talos, I finally decided to put my second TuringPi to work. This required some preparation on the hardware side like ordering new RK1 modules and an 19" case for the two TuringPis, but also on the software side. The Talos cluster I am operating is still the initial one I had spun up, and of course my understanding of how to operate Talos was quite limited back then. I think the most severe mistake was using talosctl edit mc ... to often, instead of creating patch files to apply to all the nodes. So in preparation of adding another four nodes to that cluster, I decided to take a look at talhelper which promises to help with a few shortcomings of the current talosctl based approach. In general, transitioning from the current configuration to talhelper worked rather flawlessly and I was able to apply all the generated configurations within a few hours. The main work was just comparing existing configurations with the ones generated by talhelper, then fixing things until they were identical. Adding New Nodes With all the new configuration files in place and the second TuringPi set up, it as time to add the new nodes one by one. So I entered the corresponding talosctl apply-config --insecure ... command for the first node, the configuration was applied, and then… nothing. The node would not show up in the list of Kubernetes nodes. Hmm… So I tried the obvious things: I restarted the node, I put it back into maintenance mode and re-applied to configuration. I looked at the configuration file. Nothing. If this wouldn’t have been Talos, I would’ve tried to ssh into the system, but that wasn’t and option. And because the node did not show up in Kubernetes, using the workaround described in How to SSH into Talos Linux didn’t work either. After some more investigation, I found out that the node at least was showing up in the member list of the Talos cluster, but it did not respond to any talosctl commands. For example, I wasn’t able to do a talosctl logs kubelet to see if there were any obvious errors in the logs. After I had exhausted by ideas, I tried to set up a different node type, in this case a control-plane node, hoping that I might get different information out of that. As with the worker node, the control-plane node would not show up in Kubernetes, but it would at least appear as a learner in the etcd members list. However, it was stuck as a learner forever. Running out of ideas, I reached out to the Talos Slack as well as the TuringPi Discord server to hunt for ideas, but not much was coming back there. The fact that it was over the weekend probably didn’t help much either. Time to give up for the day. The Next Morning As so often, getting some distance to the problem helped solving it. When I was looking at the configuration file the next morning, I noticed that I had increased the MTU size for all of the nodes to 3000. That was the upper limit that I could get to work on the first TuringPi node. Was it possible that this was the culprit? I quickly changed the MTU for the node that I wanted to add to the cluster, applied the configuration, and… it worked. What’s Next? So apparently, on the new TuringPi the nodes cannot communicate properly when setting the MTU to 3000, when on the old TuringPi it seems they can. One possible option is that this is related to the newer TuringPi firmware that I have installed on the second TuringPi, which exposes the network switch to the BMC, allowing people to configure it via the BMC in the future. Still haven’t made up my mind, if I should look into this further or if I should just stick to the standard MTU to avoid future issues…

Talos Cluster Lost Connectivity...

Why the nodes in my Talos cluster could not talk to each other anymore...