Page MenuHomePhabricator

Data-Platform-SRE (2024.09.28 - 2024.10.18)Milestone
ArchivedPublic

Members (6)

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

Milestone for Data Platform SRE work

Recent Activity

Yesterday

Maintenance_bot removed a project from T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs: Patch-For-Review.
Mon, Oct 21, 9:31 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a comment to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs.

Change #1081911 merged by Brouberol:

[operations/puppet@production] ceph/server: fix typo in caps

https://gerrit.wikimedia.org/r/1081911

Mon, Oct 21, 8:35 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a project to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs: Patch-For-Review.
Mon, Oct 21, 8:33 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a comment to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs.

Change #1081911 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] ceph/server: fix typo in caps

https://gerrit.wikimedia.org/r/1081911

Mon, Oct 21, 8:33 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
Maintenance_bot removed a project from T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs: Patch-For-Review.
Mon, Oct 21, 8:30 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a comment to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs.

Change #1081905 merged by Brouberol:

[operations/puppet@production] ceph/server: fix the dse-k8s-csi-cephfs according to the CSI doc

https://gerrit.wikimedia.org/r/1081905

Mon, Oct 21, 8:25 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a comment to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs.

Change #1081903 merged by Brouberol:

[operations/deployment-charts@master] ceph-csi-cephs: fix RBAC by granting cluster-wide permisions on PVC and storageclasses

https://gerrit.wikimedia.org/r/1081903

Mon, Oct 21, 8:20 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a comment to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs.

Change #1081905 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] ceph/server: fix the dse-k8s-csi-cephfs according to the CSI doc

https://gerrit.wikimedia.org/r/1081905

Mon, Oct 21, 8:04 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a project to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs: Patch-For-Review.
Mon, Oct 21, 7:52 AMData-Platform-SRE (2024.09.28 - 2024.10.18), Ceph
gerritbot added a comment to T376406: Import the upstream ceph-csi-cephfs chart and adapt it to our needs.

Change #1081903 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] ceph-csi-cephs: fix RBAC by granting cluster-wide permisions on PVC and storageclasses

https://gerrit.wikimedia.org/r/1081903

Mon, Oct 21, 7:52 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Ceph

Fri, Oct 18

BTullis reopened T353788: Add kafka-stretch100[1-2] to the hadoop cluster as "Open".

Ah, there is a slight problem with an-worker1176. It won't take long to sort out.

image.png (421×819 px, 92 KB)

The partition table on /dev/sda looks like it was created for the operating system disk, which makes sense.
We can see here that /dev/sda1 is only 1 GB in size, with 3.6TB unused in /dev/sda2

btullis@an-worker1176:~$ lsblk /dev/sda
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0  3.6T  0 disk 
├─sda1   8:1    0  953M  0 part /var/lib/hadoop/data/m
└─sda2   8:2    0  3.6T  0 part

I think that we will have to modify the partition table and resize the file system.

Fri, Oct 18, 4:56 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18)
BTullis archived Data-Platform-SRE (2024.09.28 - 2024.10.18).
Fri, Oct 18, 3:24 PM
BTullis moved T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues from Blocked/Waiting to Needs Review on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.
Fri, Oct 18, 2:47 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
xcollazo added a comment to T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.

...
The difference is for users: if we put it in the global spark-conf, users can have the cases where there is corrupted (partial) data in the result folder they're writing to, and we prefer to avoid this.
It's less of a problem for production jobs as failed spark jobs will be automatically retried, or raise an error.

Fri, Oct 18, 2:47 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
BTullis renamed T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues from Make Airflow SaprkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues to Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.
Fri, Oct 18, 2:47 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
CodeReviewBot added a comment to T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/879

Fri, Oct 18, 2:42 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
Maintenance_bot removed a project from T375714: Cleanup obsolete DAGs from Datahub: Patch-For-Review.
Fri, Oct 18, 2:29 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
CodeReviewBot added a comment to T375714: Cleanup obsolete DAGs from Datahub.

stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/872

Fri, Oct 18, 2:24 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
gerritbot added a comment to T374948: Migrate airflow webservers to Kubernetes.

Change #1081261 merged by Bking:

[labs/private@master] analytics_test_cluster: add secret

https://gerrit.wikimedia.org/r/1081261

Fri, Oct 18, 1:55 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
Jelto added a comment to T350793: move commons-query.wikimedia.org and query.wikidata.org to kubernetes.

The custom-config.json has been added and is now available on Wikikube:

Fri, Oct 18, 12:29 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), User-ItamarWMDE, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikidata Query UI, GitLab (Pipeline Services Migration🐤), collaboration-services
gerritbot added a comment to T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.

Change #975006 abandoned by Btullis:

[operations/puppet@production] Set a non-default mapreduce file committer algorithm for spark

Reason:

See: https://phabricator.wikimedia.org/T351388#10237936

https://gerrit.wikimedia.org/r/975006

Fri, Oct 18, 12:03 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
BTullis created T377571: dumps-exception-checker.service on dumpsdata1006 is repeatedly failing.
Fri, Oct 18, 11:48 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Dumps-Generation, Data-Platform
BTullis closed T376977: Alert in need of triage: PrometheusMysqldExporterFailed (instance dbstore1009:13350) as Resolved.
Fri, Oct 18, 11:32 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18), sre-alert-triage
gerritbot added a comment to T350793: move commons-query.wikimedia.org and query.wikidata.org to kubernetes.

Change #1081382 merged by jenkins-bot:

[operations/deployment-charts@master] wikidata-query-gui: fix volumeMount with subPath

https://gerrit.wikimedia.org/r/1081382

Fri, Oct 18, 10:58 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), User-ItamarWMDE, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikidata Query UI, GitLab (Pipeline Services Migration🐤), collaboration-services
BTullis closed T372284: Create PostgreSQL monitors as Resolved.
Fri, Oct 18, 10:56 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18)
gerritbot added a comment to T350793: move commons-query.wikimedia.org and query.wikidata.org to kubernetes.

Change #1081382 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] wikidata-query-gui: fix volumeMount with subPath

https://gerrit.wikimedia.org/r/1081382

Fri, Oct 18, 10:40 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), User-ItamarWMDE, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikidata Query UI, GitLab (Pipeline Services Migration🐤), collaboration-services
gerritbot added a comment to T350793: move commons-query.wikimedia.org and query.wikidata.org to kubernetes.

Change #1079466 merged by jenkins-bot:

[operations/deployment-charts@master] wikidata-query-gui: mount custom-config.json into pod

https://gerrit.wikimedia.org/r/1079466

Fri, Oct 18, 10:22 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), User-ItamarWMDE, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikidata Query UI, GitLab (Pipeline Services Migration🐤), collaboration-services
gerritbot added a comment to T350793: move commons-query.wikimedia.org and query.wikidata.org to kubernetes.

Change #1079465 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: add support to mount add confimaps

https://gerrit.wikimedia.org/r/1079465

Fri, Oct 18, 10:22 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), User-ItamarWMDE, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikidata Query UI, GitLab (Pipeline Services Migration🐤), collaboration-services
Stevemunene closed T353788: Add kafka-stretch100[1-2] to the hadoop cluster as Resolved.
Fri, Oct 18, 10:08 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18)
Stevemunene updated the task description for T353788: Add kafka-stretch100[1-2] to the hadoop cluster.
Fri, Oct 18, 10:08 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18)
Stevemunene closed T369582: Enable prometheus metrics on the cephosd cluster as Resolved.
Fri, Oct 18, 10:07 AM · Ceph, Data-Platform-SRE (2024.09.28 - 2024.10.18)
BTullis added a comment to T376657: Unable to find ingested tables in datahub.

When I do a search for discovery in DataHub I can see 17 hive tables returned.

image.png (682�867 px, 66 KB)

I exported the results as a CSV and the entity URLs are here:

Fri, Oct 18, 10:05 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Discovery-Search (Current work), Data-Engineering
BTullis added a comment to T375716: Ensure the Airflow API can be reached out to from within Kubernetes and is authenticated.

How would Kerberos authentication play out when it comes to different Airflow instances communicating with each other? Would each instance get a Kerberos ticket automatically, and then use that to communicate to other instances?

Yes. Each Airflow instance will have a kerberos principal associated with it anyway. The workers will need this principal in order to access HDFS, YARN, and Hive.
We then supply each instance with a keytab which is a means of making sure that services can automatically authenticate with their given Kerberos principal.

Fri, Oct 18, 9:16 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
Stevemunene moved T353788: Add kafka-stretch100[1-2] to the hadoop cluster from In Progress to Done on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.

The hosts are now set back to Active on netbox, and are now part of the hadoop cluster and are catching up to the production hosts. Keeping an eye on this

image.png (510�2 px, 112 KB)

Fri, Oct 18, 8:37 AMData-Platform-SRE (2024.09.28 - 2024.10.18)
Stevemunene moved T369582: Enable prometheus metrics on the cephosd cluster from In Progress to Done on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.
Fri, Oct 18, 8:37 AMCeph, Data-Platform-SRE (2024.09.28 - 2024.10.18)

Thu, Oct 17

gerritbot added a project to T377490: Requesting access to airflow-analytics-product-admins for jebe: Patch-For-Review.
Thu, Oct 17, 11:05 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
gerritbot added a comment to T377490: Requesting access to airflow-analytics-product-admins for jebe.

Change #1081285 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] Add jebe to airflow-analytics-product-admins per access request

https://gerrit.wikimedia.org/r/1081285

Thu, Oct 17, 11:05 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
Eevans updated the task description for T377490: Requesting access to airflow-analytics-product-admins for jebe.
Thu, Oct 17, 11:00 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
odimitrijevic added a comment to T377490: Requesting access to airflow-analytics-product-admins for jebe.

Approved

Thu, Oct 17, 10:03 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
gerritbot added a comment to T374948: Migrate airflow webservers to Kubernetes.

Change #1081268 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] airflow: make 'secret_key' configurable

https://gerrit.wikimedia.org/r/1081268

Thu, Oct 17, 9:55 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
Eevans added a comment to T377490: Requesting access to airflow-analytics-product-admins for jebe.

I approve membership in airflow-analytics-product-admins

(Olja will need to approve as @JEbe-WMF's manager.)

Thu, Oct 17, 9:04 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
Eevans updated the task description for T377490: Requesting access to airflow-analytics-product-admins for jebe.
Thu, Oct 17, 9:02 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Thu, Oct 17, 8:45 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
gerritbot added a project to T374948: Migrate airflow webservers to Kubernetes: Patch-For-Review.
Thu, Oct 17, 8:39 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
gerritbot added a comment to T374948: Migrate airflow webservers to Kubernetes.

Change #1081261 had a related patch set uploaded (by Bking; author: Bking):

[labs/private@master] analytics_test_cluster: add secret

https://gerrit.wikimedia.org/r/1081261

Thu, Oct 17, 8:39 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
Ottomata added a comment to T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .

FYI, just updated Ops week page with docs on using Airflow cli to rerun tasks:

Thu, Oct 17, 8:01 PMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Thu, Oct 17, 7:07 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
mpopov added a comment to T377490: Requesting access to airflow-analytics-product-admins for jebe.

I approve membership in airflow-analytics-product-admins

Thu, Oct 17, 6:58 PMData-Platform-SRE (2024.10.19 - 2024.11.08), SRE, SRE-Access-Requests
Maintenance_bot removed a project from T374948: Migrate airflow webservers to Kubernetes: Patch-For-Review.
Thu, Oct 17, 6:31 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
gerritbot added a comment to T374948: Migrate airflow webservers to Kubernetes.

Change #1081230 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-analytic-test: disable remote logging

https://gerrit.wikimedia.org/r/1081230

Thu, Oct 17, 6:25 PMData-Platform-SRE (2024.10.19 - 2024.11.08)