Last modified: 2011-03-13 18:06:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 5177 - A small enhancement to Special:Shortpages
A small enhancement to Special:Shortpages
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
Special pages (Other open bugs)
1.6.x
All All
: Lowest enhancement with 1 vote (vote)
: ---
Assigned To: Shaun Gosse
http://en.wikipedia.org/wiki/Special:...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-03-06 02:09 UTC by Shaun Gosse
Modified: 2011-03-13 18:06 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Diff for patch (1.71 KB, patch)
2006-03-06 02:10 UTC, Shaun Gosse
Details
New patch (1.83 KB, patch)
2006-03-07 15:15 UTC, Shaun Gosse
Details

Description Shaun Gosse 2006-03-06 02:09:35 UTC
This is my first time using bugzilla, so if I did something wrong, let me know.
I wrote a small bit of code that I think will make Special:Shortpages more
useful. It checks the text of the pages and excludes those which are redirects,
soft redirects, links to the wiktionary definition (basically soft redirects),
or the template for copyvios. I've tested the code on my sandbox wiki, so I'm
fairly confidant it's stable. The two problems with it that I know of are that
the count at the top is wrong ("Showing below up to 8 results starting with #1."
lists the number before filtering, so will actually be lower) and the related
problem that instead of having 1000 results, it has 1000 - results filtered. I'm
not entirely sure how to fix these yet, but I'll get there.
Comment 1 Shaun Gosse 2006-03-06 02:10:35 UTC
Created attachment 1439 [details]
Diff for patch

I think this is the correct format.
Comment 2 Rob Church 2006-03-06 03:22:19 UTC
Performance murder.
Comment 3 Brion Vibber 2006-03-06 05:40:17 UTC
Excluding based on redirect status, if done, should be done based on the 
page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is 
not the only possible form so it would not be correct.

If done, this should probably be done in the query rather than a filter 
afterwards. This may require changes to the indexes on the table to be done 
efficiently.

Excluding based on size, if done, should be done based on a cap to the 
page_len field in the query.

The filters for specific templates look a bit dodgy; also they're very specific to 
one web site and would be incorrect for everybody else in the world. If done, it 
may or may not be cheaper and more accurate to check the templatelinks 
table rather than loading page text out of indirected, compressed external 
storage.
Comment 4 Shaun Gosse 2006-03-06 13:36:56 UTC
(In reply to comment #3)
> Excluding based on redirect status, if done, should be done based on the 
> page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is 
> not the only possible form so it would not be correct.
> 
Yeah, that makes sense, I hadn't thought of that.
> If done, this should probably be done in the query rather than a filter 
> afterwards. This may require changes to the indexes on the table to be done 
> efficiently.
Hm, I'll see if I can find out how to do that in SQL, I don't know that well (or
at all).
> Excluding based on size, if done, should be done based on a cap to the 
> page_len field in the query.
Okay, I'll put that, once I look up how to do that in MySQL.
> The filters for specific templates look a bit dodgy; also they're very
specific to 
> one web site and would be incorrect for everybody else in the world. If done, it 
> may or may not be cheaper and more accurate to check the templatelinks 
> table rather than loading page text out of indirected, compressed external 
> storage.
Yeah, that's true, I have been thinking of wikipedia. I've been thinking about
it, and I think it would probably be enough to exclude any page that includes a
template, that would be much more general, and I think that would help find the
right one.

> Performance murder.
Granted, I have no clue how long this would take to run, but since they are
cached for a long time, it would only matter once. I think it would be able to
run during an off-time, I don't think it'd take *too* long. It does take time to
grab the text, but all of these pages will be less than 1k, and the majority
will be around 15 bytes, so I don't think that would be overly time consuming,
at least if only run once in a while.
Comment 5 Rob Church 2006-03-06 13:47:16 UTC
"It does take time to grab the text, but all of these pages will be less than
1k, and the majority will be around 15 bytes"

Er, well, the SQL being run on Special:Shortpages is what's used to cache up the
content on larger installations like the Wikimedia cluster. So the cron job to
cache the things would hit all the pages, thus you're still looking at an
unpleasant load.

Clean implementation in MySQL is probably not too much of a problem, however and
wouldn't be a bad thing; we can justify adding a short time to the execution by
the improved "algorithm" we'd be using and the more accurate result set.
Comment 6 Shaun Gosse 2006-03-07 15:15:14 UTC
Created attachment 1444 [details]
New patch

Okay, I realized that this script already excludes redirects, so I removed that
from my patch. Added limit to size in the SQL, and generalized script so all
pages with "{{" are excluded. Number problem as above is still present.
Comment 7 Shaun Gosse 2006-03-09 22:31:55 UTC
Anyone have any comments on my new version of the patch?
Comment 8 Rob Church 2006-04-01 23:16:44 UTC
(In reply to comment #7)
> Anyone have any comments on my new version of the patch?

Would still be too expensive to look at the text of each like that. Consider
doing something with the templatelinks table to test for transclusions in the
page; this means your query is more efficient and doesn't pull redundant data
off the MySQL server in the first place.
Comment 9 Rob Church 2006-05-21 00:42:45 UTC
Closing due to a number of reasons.

1. Inefficient, performance-killing patch.
2. There is an effective limit on page size provided by the limit on results
when the page is being cached.
3. That a page contains a template doesn't mean it's a "long" page.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links