If a redlink is the last entry in a ''paragraph'' in german wiktionary, the text of the next template is concatenated to this redlinks text and this combination is put as one word into the fulltext searchindex. As a consequence the entry is not found in fulltext search using the redlink as searchtext. For example: Entering "Transportmedium" in the searchfield the entry "Tr�ger" is not found. But entering "TransportmediumUnterbegriffe" or "Badeanzugtr�gerBeispiele" as searchtext will do as also "insource:/Transportmedium/ or "insource:/Badeanzugtr�ger/ will do.
see also: https://de.wiktionary.org/wiki/Wiktionary:Fragen_zum_Wiktionary#Wiktionary-Suche
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Preserve whitespace in search index text content | mediawiki/core | master | +80 -11 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | EBernhardson | T218001 Single name, search failure | |||
Resolved | EBernhardson | T195389 Text content of wiki page in search index can merge words making them unfindable. |
Event Timeline
This looks like a problem in the step that converts the wikitext parser's html output into plain text. Will need to look a bit closer.
Good search: https://de.wiktionary.org/w/index.php?search=insource%3A%2FBadeanzugtr%C3%A4ger%2F&fulltext=1
Bad search: https://de.wiktionary.org/w/index.php?search=Badeanzugtr%C3%A4ger&fulltext=1
Example page: https://de.wiktionary.org/wiki/Tr%C3%A4ger?action=cirrusdump
Content of the 'text' field:
BadeanzugträgerBeispiele
Wikitext in that area:
:[6] [[Bildträger]], [[Datenträger]], [[Instrumententräger]], [[Objektträger]], [[Querträger]], [[Siebträger]], [[Schriftträger]], [[Tonträger]], [[Überweisungsträger]], [[Unterträger]] :[1, 2] [[Badeanzugträger]] {{Beispiele}} :[1] Morgens unterhalb des Mount Everest: „Ruf doch mal den ''Träger!''“
The explanation is boils down to this:
<dl> <dd>foo</dd> <dd>bar</dd> </dl> <p>baz</p>
When passed through HtmlFormatter::filterContent along with Sanitizer::stripAllTags, like is done to extract the content for the search index, the above returns: foobarbaz with no delimiters. I'm not sure what a real solution to this problem is, but a hack already exists that adds spaces before <br> tags. My suggestion would be to expand this to include a few other tags that indicate whitespace between the content of that tag and the prior content?
Adding <dd> and <p> to that list should do the trick, although I have been unable to reproduce locally and had to use the mwrepl on mwdebug1001. Will try and figure out how to apply that on mwdebug to verify before sending patch to gerrit.
Change 459657 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Preserve whitespace in search index text content
Change 459657 merged by jenkins-bot:
[mediawiki/core@master] Preserve whitespace in search index text content