Last modified: 2011-03-13 18:06:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7177, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 5177 - A small enhancement to Special:Shortpages


Summary:	A small enhancement to Special:Shortpages

Status:	RESOLVED WONTFIX

Product:	MediaWiki
Classification:	Unclassified
Component:	Special pages (Other open bugs)
Version:	1.6.x
Hardware:	All All

Importance:	Lowest enhancement with 1 vote (vote)
Target Milestone:	---
Assigned To:	Shaun Gosse

URL:	http://en.wikipedia.org/wiki/Special:...
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2006-03-06 02:09 UTC by Shaun Gosse
Modified:	2011-03-13 18:06 UTC (History)
CC List:	2 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Diff for patch (1.71 KB, patch) 2006-03-06 02:10 UTC, Shaun Gosse	Details
New patch (1.83 KB, patch) 2006-03-07 15:15 UTC, Shaun Gosse	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Description Shaun Gosse 2006-03-06 02:09:35 UTC

This is my first time using bugzilla, so if I did something wrong, let me know.
I wrote a small bit of code that I think will make Special:Shortpages more
useful. It checks the text of the pages and excludes those which are redirects,
soft redirects, links to the wiktionary definition (basically soft redirects),
or the template for copyvios. I've tested the code on my sandbox wiki, so I'm
fairly confidant it's stable. The two problems with it that I know of are that
the count at the top is wrong ("Showing below up to 8 results starting with #1."
lists the number before filtering, so will actually be lower) and the related
problem that instead of having 1000 results, it has 1000 - results filtered. I'm
not entirely sure how to fix these yet, but I'll get there.

Comment 1 Shaun Gosse 2006-03-06 02:10:35 UTC

Created attachment 1439 [details]
Diff for patch

I think this is the correct format.

Comment 2 Rob Church 2006-03-06 03:22:19 UTC

Performance murder.

Comment 3 Brion Vibber 2006-03-06 05:40:17 UTC

Excluding based on redirect status, if done, should be done based on the 
page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is 
not the only possible form so it would not be correct.

If done, this should probably be done in the query rather than a filter 
afterwards. This may require changes to the indexes on the table to be done 
efficiently.

Excluding based on size, if done, should be done based on a cap to the 
page_len field in the query.

The filters for specific templates look a bit dodgy; also they're very specific to 
one web site and would be incorrect for everybody else in the world. If done, it 
may or may not be cheaper and more accurate to check the templatelinks 
table rather than loading page text out of indirected, compressed external 
storage.

Comment 4 Shaun Gosse 2006-03-06 13:36:56 UTC

(In reply to comment #3)
> Excluding based on redirect status, if done, should be done based on the 
> page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is 
> not the only possible form so it would not be correct.
> 
Yeah, that makes sense, I hadn't thought of that.
> If done, this should probably be done in the query rather than a filter 
> afterwards. This may require changes to the indexes on the table to be done 
> efficiently.
Hm, I'll see if I can find out how to do that in SQL, I don't know that well (or
at all).
> Excluding based on size, if done, should be done based on a cap to the 
> page_len field in the query.
Okay, I'll put that, once I look up how to do that in MySQL.
> The filters for specific templates look a bit dodgy; also they're very
specific to 
> one web site and would be incorrect for everybody else in the world. If done, it 
> may or may not be cheaper and more accurate to check the templatelinks 
> table rather than loading page text out of indirected, compressed external 
> storage.
Yeah, that's true, I have been thinking of wikipedia. I've been thinking about
it, and I think it would probably be enough to exclude any page that includes a
template, that would be much more general, and I think that would help find the
right one.

> Performance murder.
Granted, I have no clue how long this would take to run, but since they are
cached for a long time, it would only matter once. I think it would be able to
run during an off-time, I don't think it'd take *too* long. It does take time to
grab the text, but all of these pages will be less than 1k, and the majority
will be around 15 bytes, so I don't think that would be overly time consuming,
at least if only run once in a while.

Comment 5 Rob Church 2006-03-06 13:47:16 UTC

"It does take time to grab the text, but all of these pages will be less than
1k, and the majority will be around 15 bytes"

Er, well, the SQL being run on Special:Shortpages is what's used to cache up the
content on larger installations like the Wikimedia cluster. So the cron job to
cache the things would hit all the pages, thus you're still looking at an
unpleasant load.

Clean implementation in MySQL is probably not too much of a problem, however and
wouldn't be a bad thing; we can justify adding a short time to the execution by
the improved "algorithm" we'd be using and the more accurate result set.

Comment 6 Shaun Gosse 2006-03-07 15:15:14 UTC

Created attachment 1444 [details]
New patch

Okay, I realized that this script already excludes redirects, so I removed that
from my patch. Added limit to size in the SQL, and generalized script so all
pages with "{{" are excluded. Number problem as above is still present.

Comment 7 Shaun Gosse 2006-03-09 22:31:55 UTC

Anyone have any comments on my new version of the patch?

Comment 8 Rob Church 2006-04-01 23:16:44 UTC

(In reply to comment #7)
> Anyone have any comments on my new version of the patch?

Would still be too expensive to look at the text of each like that. Consider
doing something with the templatelinks table to test for transclusions in the
page; this means your query is more efficient and doesn't pull redundant data
off the MySQL server in the first place.

Comment 9 Rob Church 2006-05-21 00:42:45 UTC

Closing due to a number of reasons.

1. Inefficient, performance-killing patch.
2. There is an effective limit on page size provided by the limit on results
when the page is being cached.
3. That a page contains a template doesn't mean it's a "long" page.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links