Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 11 2015, 6:02 PM (505 w, 6 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Thu, Oct 17

JAllemandou added a comment to T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.

While talking with @Ottomata , I realized that we can relatively easily monitor late-arrived events in Gobblin when writing _IMPORTED flags: if there already is a flag in folder to be flagged, report late-events instead of overwriting the flag.
As gobblin has prometheus metrics integration, it'll be possible to report those using prometheus :)
This feels like a nice and wasy approach.

Thu, Oct 17, 2:11 PMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou added a comment to T376144: Some Gobblin folders don't have `_IMPORTED` flags.

After a talk with @Antoine_Quhen and @Ottomata we've decided to simplify the Gobblin algorithm: we'll use what would have been the low-volume algorithm for all topics instead of just the low-volume ones.
This allows to simplify the code, and to cover for cases such as volume-changing topics and manually partitioned topics that would have been corner cases otherwise.
The downside of the chosen solution is that it puts us more at risk of late events (not explaining here, too complicated - please ask if you wish me to explain :).
We had in our plans to monitor and alert about late events anyhow, so we think it's worth going in this direction.

Thu, Oct 17, 2:04 PMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou updated the task description for T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.
Thu, Oct 17, 1:56 PMData-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
JAllemandou renamed T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues from Globally configure spark to use fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues to Make Airflow SaprkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.
Thu, Oct 17, 1:56 PMData-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
JAllemandou added a comment to T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.

News on this front: After talking with @Antoine_Quhen and @Ottomata the preferred solution is to make this flag passed by our Airflow SparkSQL operator instead of being hard-coded in Saprk config.
The difference is for users: if we put it in the global spark-conf, users can have the cases where there is corrupted (partial) data in the result folder they're writing to, and we prefer to avoid this.
It's less of a problem for production jobs as failed spark jobs will be automatically retried, or raise an error.

Thu, Oct 17, 1:55 PMData-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review

Tue, Oct 15

JAllemandou added a comment to T374656: Figure whether we can lower the query planning time for wmf_dumps.wikitext_raw.

@JAllemandou - Sorry I just saw this ping. We can definitely look into this. Shall we create a new ticket?

Tue, Oct 15, 5:51 PMDumps 2.0 (Kanban Board)

Mon, Oct 14

JAllemandou added a comment to T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.

actually we should have set the parameter by default on every job.

Do you mean always? Or just when backfilling or writing in parallel?

Mon, Oct 14, 7:43 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review
JAllemandou added a comment to T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .

Thanks a lot for fixing the data-deletion checksum @xcollazo !

Mon, Oct 14, 7:20 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)

Fri, Oct 11

JAllemandou updated subscribers of T374656: Figure whether we can lower the query planning time for wmf_dumps.wikitext_raw.

About the RPC storm: I have found some documentation about new settings we should use for our HDFS namenode: https://community.cloudera.com/t5/Community-Articles/Scaling-the-HDFS-NameNode-part-3-RPC-scalability-features/ta-p/246719

Fri, Oct 11, 8:51 AMDumps 2.0 (Kanban Board)
JAllemandou added a comment to T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .

This morning I started to backfill hours with missing data and monitored jobs.
Since jobs were running in parallel, they have experienced the "temporary folder deletion" issue.
I had 2 jobs running in parallel, and the one having finished later had failed tasks. However both result folders have 257 files.
This means our solution setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 for the jobs works.

Fri, Oct 11, 8:32 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou added a comment to T373630: Investigate recent rise in Unique Devices.

Thank you for the great analysis and explanation @Hghani.

Fri, Oct 11, 8:17 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights
JAllemandou added a comment to T351388: Make Airflow SparkSQL operator set fileoutputcommitter.algorithm.version=2 to avoid concurrent write issues.

In the discussion above I made a mistake: I stated that the jobs fail while they don't. They generate corrupted data, as in a portion of the entire dataset (no data-mix between jobs).
In this wikitech doc the problem is explained and it is said that the spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2 should be used if the job fails during backfill, but actually we should have set the parameter by default on every job.

Fri, Oct 11, 7:39 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Patch-For-Review

Thu, Oct 10

JAllemandou added a comment to T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .

No such pattern on raw data:

hdfs dfs -du -s -h /wmf/data/raw/webrequest/webrequest_text/year=2024/month=10/day=09/*
Thu, Oct 10, 10:46 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou added a comment to T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .

The problem is visible on HDFS:

hdfs dfs -du -s -h /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
47.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=0
48.4 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=1
61.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=10
68.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=11
72.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=12
75.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=13
75.0 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=14
74.3 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=15
72.1 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=16
34.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=17     <------
73.2 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=18
73.2 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=19
48.5 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=2
68.8 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=20
58.9 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=21
50.4 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=22
11.2 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=23     <------
48.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=3
48.1 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=4
48.9 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=5
29.0 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=6      <-------
56.6 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=7
57.9 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=8
59.7 G  /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=9

My assumption is that the spark job has issues writing some files (we expect the job to write 256 data files plus one _SUCCESS file):

for h in $(seq 0 23); do
hdfs dfs -count /wmf/data/wmf/webrequest/webrequest_source=text/year=2024/month=10/day=9/hour=${h}
done
Thu, Oct 10, 10:31 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou updated the task description for T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .
Thu, Oct 10, 10:14 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou triaged T376882: 2024-10-10 Data Loss Incident - webrequest Hive table as Unbreak Now! priority.
Thu, Oct 10, 9:55 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou created T376882: 2024-10-10 Data Loss Incident - webrequest Hive table .
Thu, Oct 10, 9:55 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights, Data-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou added a comment to T373630: Investigate recent rise in Unique Devices.

I think this task can be closed, the investigation is done. We're now in the "solving problem" state :)

Thu, Oct 10, 7:23 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Movement-Insights
JAllemandou added a comment to T376118: Update druid config to automatically drop unused segments.

Should I set this value to true so that we delete all unused segments, or should we reduce the durationToRetain value to something lower than 90D?

Thu, Oct 10, 6:58 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review

Tue, Oct 8

JAllemandou added a comment to T376144: Some Gobblin folders don't have `_IMPORTED` flags.

Ah! I had forgotten :) But this st4ream is not ingested in hadoop.
Anyhow, if you agree with the approach, I can implement it.

Tue, Oct 8, 4:41 PMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou added a comment to T376144: Some Gobblin folders don't have `_IMPORTED` flags.

Our algorithm about flagging folders is working at topic-partition-level for high-volume topics, and at topic-level for low-volume topics. It doesn't really make sense to mix them.
My suggested solution would be to discard low-volume topic-partitions from high-volume topics. I think this would cover the issue.

Tue, Oct 8, 4:26 PMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou created T376713: Frequent HdfsRpcQueueLength alerts.
Tue, Oct 8, 12:59 PMData-Platform-SRE (2024.10.19 - 2024.11.08)
JAllemandou added a comment to T376144: Some Gobblin folders don't have `_IMPORTED` flags.

I'm also taking this as an opportunity to refactor tests for this algorithm.

Tue, Oct 8, 8:50 AMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou added a comment to T376144: Some Gobblin folders don't have `_IMPORTED` flags.

I've decided to implement the solution number 2, updating the Gobblin algorithm flagging folders. This allows to keep the solution internal to Gobblin, instead of delegating the responsibility of choosing whether a folder is done or not to Airflow.

Tue, Oct 8, 8:49 AMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review

Wed, Oct 2

JAllemandou merged T296207: Review druid deep-storage making sure that old segments having been reindexed are deleted into T376118: Update druid config to automatically drop unused segments.
Wed, Oct 2, 9:41 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou merged task T296207: Review druid deep-storage making sure that old segments having been reindexed are deleted into T376118: Update druid config to automatically drop unused segments.
Wed, Oct 2, 9:41 AMData-Platform-SRE
JAllemandou added a comment to T376118: Update druid config to automatically drop unused segments.
Wed, Oct 2, 9:40 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review

Tue, Oct 1

Antoine_Quhen awarded T376144: Some Gobblin folders don't have `_IMPORTED` flags a Like token.
Tue, Oct 1, 2:14 PMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou created T376144: Some Gobblin folders don't have `_IMPORTED` flags.
Tue, Oct 1, 1:48 PMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou added a comment to T376118: Update druid config to automatically drop unused segments.

We very much can test the setting onto the test-cluster.
I think there is small benefit in having the same parameter for the druid-public cluster: there is no reindexation, so no unused segments that should be deleted (we delete entire datasources in there).

Tue, Oct 1, 11:44 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review
JAllemandou created T376118: Update druid config to automatically drop unused segments.
Tue, Oct 1, 10:08 AMData-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review

Mon, Sep 30

JAllemandou created T376026: Update event-producing tools to overwrite `meta.dt`.
Mon, Sep 30, 11:57 AMEvent-Platform, Data-Engineering

Thu, Sep 26

JAllemandou added a comment to T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.

Exploration on specific datasets:

  • mediawiki_page_content_change_v1--> No timestamp-late-events, only gobblin/kafka hiccups
  • resource_purge --> Uses meta.dt, and there are late timestamps. We don't refine this table (see here) but not because of late events. I wonder if changing the behavior of event-gate to overwrite meta.dt would change anything to other dowstream consumers.
  • mediawiki_revision-create --> Regular small number of late events (less than 10 per hour), with one spike at ~5k per hour for 4 hours on 2024-08-30 at 8amUTC. I think this is due to the fact that meta.dt is set to the revision timestamp, and some revision changes can occur for past revisions.
  • mediawiki_recentchange --> Regular number of late events (less than 10 usually with some bumps). I don't really know why this happens, probably the same reason as the revision-create,
Thu, Sep 26, 8:58 AMData-Engineering (Q2 2024 October 1st - December 31th)

Tue, Sep 24

JAllemandou added a comment to T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.

After explaining my finding to the team yesterday, here are the following steps: I'm gonna have a look at late-events in a few streams to try to understand the reason why we have them, and at the same time we are going to change the data event-publishing contract: event-publishing libraries should set the meta.dt field and use this field as kafka-timestamp field, overriding the field value if it is already set by the producer.

Tue, Sep 24, 7:39 AMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou moved T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine from In Review to In progress on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Tue, Sep 24, 7:36 AMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou moved T368788: 8 new wikis missing from mediawiki_history from In Review to Ready to Deploy on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Tue, Sep 24, 7:36 AMData-Engineering (Q2 2024 October 1st - December 31th), Data-Platform

Sep 20 2024

JAllemandou added a comment to T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.

Second round of analysis : https://phabricator.wikimedia.org/P69381

Sep 20 2024, 1:59 PMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou created P69381 Gobblin late event analysis - Second round.
Sep 20 2024, 1:56 PMData-Engineering

Sep 18 2024

JAllemandou updated subscribers of T348774: [Maintenance] Add a deletion job for `hdfs_usage` data.

Ping @lbowmaker , we should prioritize this if we can :)

Sep 18 2024, 3:48 PMData-Engineering

Sep 17 2024

JAllemandou added a comment to T374656: Figure whether we can lower the query planning time for wmf_dumps.wikitext_raw.

Great finding @xcollazo :)

Sep 17 2024, 6:30 AMDumps 2.0 (Kanban Board)
JAllemandou added a comment to T368788: 8 new wikis missing from mediawiki_history.

Patch sent (see above). One wiki was already added to the list (btmwiki), and another was not yet present in the databases (cywikivoyage).

Apologies, it should say cswikivoyage rather than cywikivoyage. I have corrected my typo in the ticket description.

Sep 17 2024, 6:29 AMData-Engineering (Q2 2024 October 1st - December 31th), Data-Platform

Sep 16 2024

JAllemandou added a comment to T368788: 8 new wikis missing from mediawiki_history.

Patch sent (see above). One wiki was already added to the list (btmwiki), and another was not yet present in the databases (cywikivoyage).

Sep 16 2024, 1:32 PMData-Engineering (Q2 2024 October 1st - December 31th), Data-Platform
JAllemandou moved T368788: 8 new wikis missing from mediawiki_history from Next Up to In Review on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Sep 16 2024, 1:31 PMData-Engineering (Q2 2024 October 1st - December 31th), Data-Platform
JAllemandou claimed T368788: 8 new wikis missing from mediawiki_history.
Sep 16 2024, 1:31 PMData-Engineering (Q2 2024 October 1st - December 31th), Data-Platform
JAllemandou moved T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine from In progress to In Review on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Sep 16 2024, 1:31 PMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou added a comment to T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.

I ran an analysis over the /wmf/data/raw/event folder (event-platform events not yet refined) on Friday 13th of September.
The code and walk-through results can be found in this phab paste.

Sep 16 2024, 8:10 AMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou edited P69127 Late gobblin events analysis.
Sep 16 2024, 8:03 AMData-Engineering
JAllemandou edited P69127 Late gobblin events analysis.
Sep 16 2024, 8:03 AMData-Engineering
JAllemandou created P69127 Late gobblin events analysis.
Sep 16 2024, 7:40 AMData-Engineering

Sep 13 2024

JAllemandou awarded T374656: Figure whether we can lower the query planning time for wmf_dumps.wikitext_raw a Hungry Hippo token.
Sep 13 2024, 10:05 AMDumps 2.0 (Kanban Board)
JAllemandou added a comment to T374637: Decide how to make datasets owned by analytics-search-users also readable by analytics-privatedata-users.

The reason we originally started using different users/groups was to silo permissions.
I think it's still good to have different writing abilities and read abilities: only search-user can write to this datasets, but we wish all private-data-users to be able to read.

Sep 13 2024, 10:04 AMDiscovery-Search (Current work), Data-Engineering
JAllemandou added a comment to T306896: Integrate Spark with DataHub with lineage.

We had set the name with date information on purpose, to facilitate identifying tasks in Yarn (there is a bug, hourly tasks are named with day data, and this should be corrected).
I understand it would make things better for lineage to name them without the date-moving part. But looking in the Datahub spark doc, I found that we can set the flow name explicitely using spark.datahub.flow_name.

Sep 13 2024, 8:17 AMData-Engineering (Q2 2024 October 1st - December 31th), Patch-For-Review, Data-Catalog, Data Pipelines

Sep 12 2024

JAllemandou moved T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine from Next Up to In progress on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Sep 12 2024, 4:52 PMData-Engineering (Q2 2024 October 1st - December 31th)
JAllemandou claimed T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.
Sep 12 2024, 4:51 PMData-Engineering (Q2 2024 October 1st - December 31th)

Jun 20 2024

JAllemandou updated the task description for T346611: [JVM Stewardship] To be discussed: SDK Man.
Jun 20 2024, 11:33 AMPatch-For-Review, Java-Scala-Standardization

Jun 17 2024

JAllemandou added a comment to T365594: Include geocoded subdivision ISO code in webrequest table.

Here we are! we have a new subdivision_code map key in our wmf.webrequest table in geocoded_data field :)

Jun 17 2024, 3:42 PMprobenet, Data-Engineering
JAllemandou added a comment to T355920: DISCUSS: Relocate Generated Data Platform (ne� AQS) test/dev tables?.

How is the aqs_testing keyspace used? How often is it used?

Jun 17 2024, 9:12 AMCassandra

Jun 12 2024

JAllemandou awarded T358373: [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions a Unicorn! token.
Jun 12 2024, 3:22 PMPatch-For-Review, Dumps 2.0 (Kanban Board)

Jun 11 2024

JAllemandou added a comment to T366869: Add cx_translators to sqoop.

That's actually better! There is no specific need for us to sqoop into Data Lake if we can setup a pipeline directly. Parallelly, for another task, I am working on setting up an Airflow pipeline that loads the data from MySQL and does the necessary calculations (T362615; slack thread). If you think that's better, I can wait to figure that out and use the same for this as well. I believe an Airflow job will also give more control over the frequency.

Jun 11 2024, 5:15 PMProduct-Analytics (Kanban), Language-analytics
JAllemandou added a comment to T366869: Add cx_translators to sqoop.
Jun 11 2024, 8:46 AMProduct-Analytics (Kanban), Language-analytics

Jun 7 2024

JAllemandou added a comment to T366869: Add cx_translators to sqoop.

Hi @KCVelaga_WMF , I have quickly looked at the patch you sent and I have questions:

Jun 7 2024, 10:28 AMProduct-Analytics (Kanban), Language-analytics
JAllemandou merged T363451: Add job to create Wikidata partition to wmf.mediawiki_wikitext_history into T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.
Jun 7 2024, 9:58 AMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou merged task T363451: Add job to create Wikidata partition to wmf.mediawiki_wikitext_history into T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.
Jun 7 2024, 9:58 AMData-Engineering
JAllemandou added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

The spark job finished, we have data (from superset):

SELECT
    revision_id,
    revision_timestamp
FROM wmf.mediawiki_wikitext_history
WHERE snapshot = '2024-04'
    AND wiki_db = 'wikidatawiki'
LIMIT 10
Jun 7 2024, 9:55 AMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou placed T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs up for grabs.
Jun 7 2024, 7:01 AMPatch-Needs-Improvement, Data-Engineering
JAllemandou moved T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history from In Review to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Jun 7 2024, 7:00 AMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

Hi Folks, I've been late in delivering this but it's landing as I write.
The spark job transforming wikidata-xml-history for snapshot 2024-04 is currently running . I expect it to finish either today or tomorrow.
The spark job is scheduled by a separate Airflow DAG and only computes Wikidata-xml-history, while the other DAG keeps running without wikidata to get data faster on other dumps.
This month having seen errors on the dumps release process, it's a bad timing for tests, but hopefully things will settle and we'll have a proper run next month.

Jun 7 2024, 6:59 AMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)

Jun 6 2024

JAllemandou awarded T364583: Consider what mechanism to use to make date deletion possible a Stroopwafel token.
Jun 6 2024, 10:20 AMData Products (Data Products Sprint 14), Commons-Impact-Metrics

Jun 5 2024

JAllemandou awarded T360922: [Status Store] [SPIKE] Investigate and document approach for Iceberg Sensors a Love token.
Jun 5 2024, 3:17 PMData-Engineering (Q2 2024 October 1st - December 31th), Dumps 2.0 (Kanban Board), Spike

Jun 3 2024

JAllemandou moved T366042: Clickstream datasets only reference 'other' link type, no 'link' from Radar (External Teams) to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Jun 3 2024, 9:08 AMData-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history from Done to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Jun 3 2024, 9:08 AMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history from In Review to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Jun 3 2024, 9:08 AMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T366042: Clickstream datasets only reference 'other' link type, no 'link' from In Review to Radar (External Teams) on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Jun 3 2024, 9:08 AMData-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou added a comment to T365197: ISPDatabaseReader null pointer exception.

Task description updated with latest stack trace.

Jun 3 2024, 7:09 AMData-Platform-SRE (2024.05.27 - 2024.06.16), Patch-For-Review, Data-Engineering

May 30 2024

JAllemandou reopened T365197: ISPDatabaseReader null pointer exception as "Open".
May 30 2024, 4:54 PMData-Platform-SRE (2024.05.27 - 2024.06.16), Patch-For-Review, Data-Engineering

May 28 2024

JAllemandou moved T366042: Clickstream datasets only reference 'other' link type, no 'link' from Next Up to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
May 28 2024, 9:47 AMData-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T355588: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes from In Review to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
May 28 2024, 9:47 AMData-Engineering (Q4 2024 April 1st - June 30th), Data Products
JAllemandou edited projects for T366042: Clickstream datasets only reference 'other' link type, no 'link', added: Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering.
May 28 2024, 7:46 AMData-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou claimed T366042: Clickstream datasets only reference 'other' link type, no 'link'.
May 28 2024, 7:46 AMData-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou created T366042: Clickstream datasets only reference 'other' link type, no 'link'.
May 28 2024, 7:43 AMData-Engineering (Q4 2024 April 1st - June 30th)

May 27 2024

JAllemandou created T366004: Add page-title to the x_analytics header.
May 27 2024, 3:13 PMData-Engineering

May 22 2024

JAllemandou added a comment to T365321: Pageviews complete dumps have lots of rows with article name = '-'.

Hi @Audiodude, thank you for reporting your issue.

May 22 2024, 7:43 PMData Products, Data-Engineering, Analytics, Data-Engineering-Wikistats
JAllemandou added a comment to T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?.

we had programmatic ways to clean airflow (values?) caches. My understanding is that the current approach would facilitate that.

@JAllemandou can you explain this one? I don't know what this means or what the problem is. Thank you!

May 22 2024, 6:35 PMData-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform

May 21 2024

JAllemandou added a comment to T364583: Consider what mechanism to use to make date deletion possible.

Super interesting that cassandra can now do a delete via Range! This could help us apply a deletion-policy if we sometimes decide to use one!

May 21 2024, 3:56 PMData Products (Data Products Sprint 14), Commons-Impact-Metrics
JAllemandou added a project to T365223: Fix generation of _IMPORTED flags by Gobblin: Data-Engineering.
May 21 2024, 9:17 AMData-Engineering, Data Pipelines, Patch-For-Review

May 15 2024

JAllemandou added a comment to T364893: an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024.

Quite note on what the traffic this task refers is.
Port 50010 on an-worker nodes is hadoop HDFS datanodes data transfer port.
This is the port serving data requested to HDFS served by that worker. Most of the traffic on this port is expected to be coming either from same/other an-worker nodes, or from an-presto nodes (both are compute-nodes reading HDFS data).
The amount of data read varies a lot depending on jobs. One can see that the amount of data read from HDFS is very spiky: https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-hadoop_cluster=analytics-hadoop&from=now-7d&to=now&viewPanel=111

May 15 2024, 12:46 PMData-Platform-SRE (2024.05.06 - 2024.05.26), netops, Infrastructure-Foundations

May 13 2024

JAllemandou added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

Hi @XiaoXiao-WMF ,
the impact of this issue is that the hive wmf.mediawiki_wikitext_history is currently not containing the wikidatawiki project's data.
I sent a slack message a while asking if anyone was using this data back but made the mistake to only call-out Fabian, from the research team.
I estimate this issue to take between one and two week(s) to be solved once someone starts working on it.

May 13 2024, 1:44 PMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)

May 7 2024

JAllemandou added a comment to T364045: [Bug?] Can't find wikidatawiki on wmf.mediawiki_wikitext_history.

I'm sorry, what Luke expressed before can't be done, I messed up when looking at wikidata dumps generation: I looked at a subset of files that were generated the 8th, but forgot the rest of hte files /facepalm/.
This means we need to build a parallel pipeline for wikidata dump import as planned originally.
Sorry for the false joy :S

May 7 2024, 2:52 PMWikidata, Wikidata Analytics, Data-Engineering (Q4 2024 April 1st - June 30th)

May 6 2024

JAllemandou added a comment to T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers.

No objection :) I'd have gone for option 1 as it seems the easiest to maintain, but I agree, it means installing some stuff to the blazegraph machines.

May 6 2024, 1:34 PMDiscovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata
JAllemandou added a comment to T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers.

I would suggest using the hdfs-rsync tool to do this - it requires some setting up with puppet, but it is helpful, through copying only new stuff from folders (see https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024/modules/hdfs_tools/manifests/hdfs_rsync_job.pp and https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024/modules/dumps/manifests/web/fetches/stats.pp)

May 6 2024, 9:20 AMDiscovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata

Apr 28 2024

JAllemandou added a member for Analytics-Data-Problem: JAllemandou.
Apr 28 2024, 8:10 PM

Apr 26 2024

JAllemandou updated subscribers of T307540: [Airflow Migration] Migrate reportupdater jobs.

Reportupdater jobs have all been either deprecated or migrated to Airflow!
The report-updater jobs have been stopped, and data-synchronization have been updated from the report-updater folders to hadoop-folders (updated by airflow jobs).
We can call report-updater deprecated for real, even if we still need to do some code cleaning.
Also, if anything goes wrong with the new system, we still have the data generated by reportupdater stored on HDFS and we can reset the old system.
This is a great step toward not using other scheduler than airflow - Lots of kudos to @amastilovic for migrating the jobs, and to @BTullis for finalizing the operations on deprecating the tool.

Apr 26 2024, 11:22 AMData-Engineering, Data Pipelines
JAllemandou renamed T307540: [Airflow Migration] Migrate reportupdater jobs from [Airflow Migration] Migrate 1+ reportupdater jobs to [Airflow Migration] Migrate reportupdater jobs.
Apr 26 2024, 10:20 AMData-Engineering, Data Pipelines
JAllemandou moved T362699: Update converted reportupdater DAG queries to correct CSV options from Next Up to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 26 2024, 9:19 AMData-Engineering (Q4 2024 April 1st - June 30th)
JAllemandou moved T355588: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes from In progress to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 26 2024, 9:19 AMData-Engineering (Q4 2024 April 1st - June 30th), Data Products

Apr 24 2024

JAllemandou awarded T359215: mediawiki_cirrussearch_request data is regularly late a Barnstar token.
Apr 24 2024, 6:26 PMPerformance Issue, Data-Platform

Apr 23 2024

JAllemandou claimed T355588: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes.
Apr 23 2024, 10:55 AMData-Engineering (Q4 2024 April 1st - June 30th), Data Products
JAllemandou moved T355588: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes from Next Up to In progress on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 23 2024, 10:54 AMData-Engineering (Q4 2024 April 1st - June 30th), Data Products
JAllemandou added a comment to T362181: Encrypt Airflow connections to AQS Cassandra.

I'm not sure if the spark-cassandra-connector can read a Java Truststore on HDFS! I'd go for an automated deployment of the trustore on every cluster host. For the moment it'll be enough as our prod jobs are launched fron the cluster (skein). It would probably also be good to have the truststore deployed on stat machines, to allow for manual runs. This should be enough for now, until we move launchers away from skein to use k8s - We'll revisit at that time (ping @BTullis :)

Apr 23 2024, 10:44 AMData-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra

Apr 18 2024

JAllemandou added a comment to T361499: [Maintenance] Resolve long launch times for canary events on Airflow (30mins in total).

Global execution times have been divided by 3 (10mins for 170 jobs). We are using a new launchers queue to launch small jobs and have scaled the airflow parallelization to 10 tasks. We can replicate this model to other jobs :)

Apr 18 2024, 6:24 PMData-Engineering (Q4 2024 April 1st - June 30th)