Migration to containerd and away from docker
Open, HighPublic
Actions

Assigned To

Authored By

	akosiaris
	Apr 12 2024, 1:44 PM

Description

Per T269684 we need to move away from docker. In February 2024, the serviceops team announced the results of the evaluation of the candidate replacement engines. Results and criteria have been documented in Kubernetes/CRE. The chosen container runtime engine was containerd. This task describes the plan for the migration and tracks the migration process itself

Plan

containerd upgrade

We 'll probably need a new profile profile::containerd or similar.
Create proper cgroups config for containerd (https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd)
Handle pulling of restricted images with containerd (provide authentication credentials etc)
Test integration with dragonfly/dfget

for the actual upgrade

Run some workers (4 in codfw as a start) with bookworm, to surface potential OS related issues
1. wikikube-worker2085 (R440)
2. wikikube-worker2086 (R440)
3. wikikube-worker2088 (Supermicro)
4. wikikube-worker2089 (R450)
Create puppetization for the configuration required by kubernetes
Reimage some nodes with bookworm + containerd (>=1.6)
Upgrade all clusters to the newer containerd, rolling-reimage of nodes

nerdctl

Docker has a relatively user friendly CLI. containerd doesn't. the ctr tool it ships with is a lower level, albeit useful tool. nerdctl, is a CLI released by the containerd project that is CLI compatible with docker CLI

Package nerdctl. Probably utilizing our Upstream binaries policy to avoid the onus of having to build every since dependency
Use puppet to install the package and populate a nerdctl configuration file /etc/nerdctl/nerdctl.toml to default to namespace k8s.io
Test and approve.

crictl

Kubernetes build crictl/cri-tools https://github.com/kubernetes-sigs/cri-tools/tree/master to interact with a CRI the way kubelet would. In my initial tests with nerctl it did not completely honor all containerd configuration (like registry mirrors and authentication we require for dragonfly). So I decided to also package cricrl and have it installed on all nodes.

Kubelet (the above are a prereq)

Amend puppet to have behind a feature flag the following 2 parameters

--container-runtime-endpoint=unix:///run/containerd/containerd.sock 
--container-runtime=remote

Metrics

Replace kubelet_docker_operations_* with kubelet_runtime_operations_*

Log processing

Parsing of logs does not work properly with containerd nodes. Logs that usually have the k8s_docker_log_field_parsed tag don't have it anymore:

T377132: containerd logs are not properly parsed during ingestion to logstash

Things to do after all k8s nodes have been migrated off of docker

Remove puppet classes no longer in use (if there are any)
Ensure all profile::docker::engine related hiera keys are gone (as well as profile::kubernetes::node::docker_kubernetes_user_password)

How to migrate to containerd

https://wikitech.wikimedia.org/wiki/Kubernetes/Administration/containerd_migration

Details

Subject	Repo	Branch	Lines +/-
k8s.pool-depool-node: Add support for multiple nodes	operations/cookbooks	master	+161 -106
Migrate wikikube-worker208[5689] to containerd	operations/puppet	production	+5 -16
Add a cookbook to roll-reimage stacked k8s control planes	operations/cookbooks	master	+305 -0
etcd::v3: Don't set trusted-ca-file if client-cert-auth is false	operations/puppet	production	+8 -0
etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4	operations/puppet	production	+2 -2
wikikube: Remove explicit container_runtime config	operations/puppet	production	+0 -2
wikikube: Prepare clusters for containerd workers	operations/puppet	production	+24 -9
containerd: Remove container log line length limit	operations/puppet	production	+4 -0
dragonfly::dfdaemon: Refactor docker integration	operations/puppet	production	+27 -23
dragonfly::dfdaemon: Enable by default when profile is included	operations/puppet	production	+1 -5
Remove role kubernetes::staging::worker_containerd	labs/private	master	+0 -1
cumin/aliases: Merge worker_containerd back to worker role	operations/puppet	production	+4 -4
kubernetes: Create profile::kubernetes::container_runtime	operations/puppet	production	+56 -82
Merge worker_containerd back to worker role	labs/private	master	+2 -1
cumin/aliases: Remove P{O:kubernetes::staging::worker}	operations/puppet	production	+1 -1
Migrate kubestage1004 to containerd	operations/puppet	production	+1 -53
Migrate kubestage1003 to containerd	operations/puppet	production	+7 -3
k8s/kubelet: Remove absent containerd specific systemd override	operations/puppet	production	+0 -7
k8s/kubelet: Make kubelet.service depend on container runtime	operations/puppet	production	+1 -1
k8s/kubelet: Make kubelet.service depend on container runtime	operations/puppet	production	+14 -3
Remove kubelet systemd unit dependency to docker.service	operations/debs/kubernetes	v1.23	+10 -3
wikikube-staging-codfw: Migrate kubestage2002 to containerd	operations/puppet	production	+1 -1
wikikube-staging-codfw: Migrate kubestage2002 to containerd	operations/puppet	production	+1 -6
kubernetes/staging: Add role master_stacked_containerd	operations/puppet	production	+60 -0
containerd: Enable unprivileged icmp and binding to ports < 1024	operations/puppet	production	+15 -0
cumin/aliases: Add containerd roles to wikikube aliases	operations/puppet	production	+4 -4
kubelet: Remove --pod-infra-container-image when using containerd	operations/puppet	production	+4 -3
kubelet/containerd: Fix registry authentication	operations/puppet	production	+17 -5
kubernetes::worker_containerd: Fix registry_auth hiera key	labs/private	master	+2 -2
kubelet/containerd: Fix runc config and kubelet systemd unit	operations/puppet	production	+12 -5
Initial commit of containerd puppet code	operations/puppet	production	+317 -8
kubernetes::worker_containerd: Fix registry_auth hiera key	labs/private	master	+2 -2
kubernetes::worker_containerd: Add registry 'secrets'	labs/private	master	+2 -0
Rename kubernets2009,2010,2035,2054, reimage to bookworm	operations/puppet	production	+10 -10

Related Objects
Search...

Status	Assigned	Task
Open	None	T341984 Update Kubernetes clusters to >1.25
Open	JMeybohm	T269684 [EPIC] Docker deprecation as a container runtime enginer for kubernetes.
Open	JMeybohm	T362408 Migration to containerd and away from docker
Resolved	JMeybohm	T375488 prometheus node exporter filesystem metrics exclude /var/lib/docker and /var/lib/kubelet
Resolved	JMeybohm	T377132 containerd logs are not properly parsed during ingestion to logstash
Open	kamila	T377857 Cookbook to roll-reimage k8s nodes
Open	None	T377875 Migrate dse-k8s cluster from docker to containerd
Open	None	T377876 Migrate wikikube-eqiad to containerd
Open	None	T377877 Migrate wikikube-codfw to containerd

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1078417 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube-staging-codfw: Migrate kubestage2002 to containerd

https://gerrit.wikimedia.org/r/1078417

Change #1078417 merged by JMeybohm:

[operations/puppet@production] wikikube-staging-codfw: Migrate kubestage2002 to containerd

https://gerrit.wikimedia.org/r/1078417

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm

JMeybohm updated the task description. (Show Details)Mon, Oct 7, 2:25 PM

JMeybohm updated the task description. (Show Details)

Maintenance_bot removed a project: Patch-For-Review.Mon, Oct 7, 2:30 PM

JMeybohm updated the task description. (Show Details)Mon, Oct 7, 2:36 PM

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm completed:

kubestage2002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410071443_jayme_2779980_kubestage2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1078447 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/[email protected]] Remove kubelet systemd unit dependency to docker.service

https://gerrit.wikimedia.org/r/1078447

gerritbot added a project: Patch-For-Review.Mon, Oct 7, 4:19 PM

Change #1078450 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s/kubelet: Make kubelet.service depend on container runtime

https://gerrit.wikimedia.org/r/1078450

Change #1078451 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s/kubelet: Remove absent containerd specific systemd override

https://gerrit.wikimedia.org/r/1078451

Change #1078447 merged by JMeybohm:

[operations/debs/[email protected]] Remove kubelet systemd unit dependency to docker.service

https://gerrit.wikimedia.org/r/1078447

Mentioned in SAL (#wikimedia-operations) [2024-10-08T09:25:48Z] <jayme> imported kubernetes 1.23.14-4 to component/kubernetes123 (buster, bullseye, bookworm) - T362408

Change #1078450 merged by JMeybohm:

[operations/puppet@production] k8s/kubelet: Make kubelet.service depend on container runtime

https://gerrit.wikimedia.org/r/1078450

Change #1078626 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s/kubelet: Make kubelet.service depend on container runtime

https://gerrit.wikimedia.org/r/1078626

Change #1078626 merged by JMeybohm:

[operations/puppet@production] k8s/kubelet: Make kubelet.service depend on container runtime

https://gerrit.wikimedia.org/r/1078626

Mentioned in SAL (#wikimedia-operations) [2024-10-08T10:36:33Z] <jayme> updated kubernetes 1.23.14-3 -> 1.23.14-4 on P:kubernetes::node - T362408

JMeybohm updated the task description. (Show Details)Tue, Oct 8, 10:44 AM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm

Change #1078451 merged by JMeybohm:

[operations/puppet@production] k8s/kubelet: Remove absent containerd specific systemd override

https://gerrit.wikimedia.org/r/1078451

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm completed:

kubestage2002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410081109_jayme_2939229_kubestage2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Maintenance_bot removed a project: Patch-For-Review.Tue, Oct 8, 11:30 AM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm completed:

kubestage2001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410081201_jayme_2952057_kubestage2001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1078677 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate kubestage1003 to containerd

https://gerrit.wikimedia.org/r/1078677

Change #1078678 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate kubestage1004 to containerd

https://gerrit.wikimedia.org/r/1078678

There are some hardware refreshes planned which should go Bookworm + containerd right away:

{T376171}
{T376185}
{T376170}

Change #1078677 merged by JMeybohm:

[operations/puppet@production] Migrate kubestage1003 to containerd

https://gerrit.wikimedia.org/r/1078677

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed:

kubestage1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410100933_jayme_3362138_kubestage1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1078678 merged by JMeybohm:

[operations/puppet@production] Migrate kubestage1004 to containerd

https://gerrit.wikimedia.org/r/1078678

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm

In T362408#10216554, @JMeybohm wrote:

There are some hardware refreshes planned which should go Bookworm + containerd right away:

{T376171}

{T376185}

{T376170}

As well as expansions:

{T376307}
{T376665}

Maintenance_bot removed a project: Patch-For-Review.Thu, Oct 10, 12:31 PM

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm completed:

kubestage1004 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410101219_jayme_3392864_kubestage1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1079276 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Remove P{O:kubernetes::staging::worker}

https://gerrit.wikimedia.org/r/1079276

gerritbot added a project: Patch-For-Review.Thu, Oct 10, 1:01 PM

Change #1079276 merged by JMeybohm:

[operations/puppet@production] cumin/aliases: Remove P{O:kubernetes::staging::worker}

https://gerrit.wikimedia.org/r/1079276

Maintenance_bot removed a project: Patch-For-Review.Thu, Oct 10, 1:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed:

kubestage1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410101623_jayme_3430289_kubestage1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1079935 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes: Create profile::kubernetes::container_runtime

https://gerrit.wikimedia.org/r/1079935

gerritbot added a project: Patch-For-Review.Mon, Oct 14, 8:47 AM

Change #1079955 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Merge worker_containerd values back into worker

https://gerrit.wikimedia.org/r/1079955

Change #1079955 merged by JMeybohm:

[labs/private@master] Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079955

JMeybohm mentioned this in rLPRI68b0f377be2e: Merge worker_containerd back to worker role.Mon, Oct 14, 10:03 AM

Change #1079935 merged by JMeybohm:

[operations/puppet@production] kubernetes: Create profile::kubernetes::container_runtime

https://gerrit.wikimedia.org/r/1079935

Change #1079960 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079960

Change #1079960 merged by JMeybohm:

[operations/puppet@production] cumin/aliases: Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079960

Change #1079961 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Remove role kubernetes::staging::worker_containerd

https://gerrit.wikimedia.org/r/1079961

Change #1079961 merged by JMeybohm:

[labs/private@master] Remove role kubernetes::staging::worker_containerd

https://gerrit.wikimedia.org/r/1079961

JMeybohm mentioned this in rLPRI45c64ec1774d: Remove role kubernetes::staging::worker_containerd.Mon, Oct 14, 10:17 AM

Change #1079970 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube: Prepare clusters for containerd workers

https://gerrit.wikimedia.org/r/1079970

JMeybohm updated the task description. (Show Details)Mon, Oct 14, 11:25 AM

Change #1080038 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Enable by default when profile is included

https://gerrit.wikimedia.org/r/1080038

Change #1080042 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Refactor docker integration

https://gerrit.wikimedia.org/r/1080042

Change #1080071 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] containerd: Remove container log line length limit

https://gerrit.wikimedia.org/r/1080071

Change #1080038 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Enable by default when profile is included

https://gerrit.wikimedia.org/r/1080038

Change #1080042 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Refactor docker integration

https://gerrit.wikimedia.org/r/1080042

Change #1080071 merged by JMeybohm:

[operations/puppet@production] containerd: Remove container log line length limit

https://gerrit.wikimedia.org/r/1080071

Change #1079970 merged by JMeybohm:

[operations/puppet@production] wikikube: Prepare clusters for containerd workers

https://gerrit.wikimedia.org/r/1079970

Change #1080554 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube-staging: Migrate control planes to containerd

https://gerrit.wikimedia.org/r/1080554

Change #1080554 merged by JMeybohm:

[operations/puppet@production] wikikube: Remove explicit container_runtime config

https://gerrit.wikimedia.org/r/1080554

Change #992629 had a related patch set uploaded (by JMeybohm; author: Mxmxchere):

[operations/puppet@production] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false

https://gerrit.wikimedia.org/r/992629

Change #992629 merged by JMeybohm:

[operations/puppet@production] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false

https://gerrit.wikimedia.org/r/992629

Change #1081224 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4

https://gerrit.wikimedia.org/r/1081224

Change #1081377 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] Add a cookbook to roll-reimage stacked k8s control planes

https://gerrit.wikimedia.org/r/1081377

Change #1081224 merged by JMeybohm:

[operations/puppet@production] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4

https://gerrit.wikimedia.org/r/1081224

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm completed:

kubestagemaster2005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181121_jayme_1191485_kubestagemaster2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm updated the task description. (Show Details)Fri, Oct 18, 2:37 PM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bookworm completed:

kubestagemaster2003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bookworm completed:

kubestagemaster2003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster2004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181546_jayme_1220197_kubestagemaster2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm completed:

kubestagemaster2003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster2004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181546_jayme_1220197_kubestagemaster2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster2005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181632_jayme_1220197_kubestagemaster2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm closed subtask T377132: containerd logs are not properly parsed during ingestion to logstash as Resolved.Fri, Oct 18, 7:47 PM

Change #1081910 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate wikikube-worker208[5689] to containerd

https://gerrit.wikimedia.org/r/1081910

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

kubestagemaster1003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

kubestagemaster1003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster1004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210906_jayme_1708770_kubestagemaster1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

kubestagemaster1003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster1004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210906_jayme_1708770_kubestagemaster1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster1005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210946_jayme_1708770_kubestagemaster1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1082191 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.pool-depool-node: Add support for multiple nodes

https://gerrit.wikimedia.org/r/1082191

Change #1081910 merged by JMeybohm:

[operations/puppet@production] Migrate wikikube-worker208[5689] to containerd

https://gerrit.wikimedia.org/r/1081910

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2085.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2086.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2088.codfw.wmnet with OS bookworm completed:

wikikube-worker2088 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221227_jayme_1935630_wikikube-worker2088.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2085.codfw.wmnet with OS bookworm completed:

wikikube-worker2085 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221231_jayme_1935361_wikikube-worker2085.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2086.codfw.wmnet with OS bookworm completed:

wikikube-worker2086 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221234_jayme_1935371_wikikube-worker2086.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmnet with OS bookworm completed:

wikikube-worker2089 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221237_jayme_1935895_wikikube-worker2089.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm updated the task description. (Show Details)Tue, Oct 22, 3:59 PM

BTullis subscribed.Tue, Oct 22, 4:15 PM

Migration to containerd and away from dockerOpen, HighPublicActions

Description

Plan

containerd upgrade

for the actual upgrade

nerdctl

crictl

Kubelet (the above are a prereq)

Metrics

Log processing

Things to do after all k8s nodes have been migrated off of docker

How to migrate to containerd

Details

Related ObjectsSearch...

Event Timeline

Migration to containerd and away from docker
Open, HighPublic
Actions

Related Objects
Search...