Last modified: 2014-09-09 00:34:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59084, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 57084 - Store the page_id of the moved page in log_page
Store the page_id of the moved page in log_page
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Logging (Other open bugs)
1.23.0
All All
: Low minor (vote)
: ---
Assigned To: Nathan Larson
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-14 23:21 UTC by Aaron Halfaker
Modified: 2014-09-09 00:34 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Aaron Halfaker 2013-11-14 23:21:23 UTC
When pages are moved (log_action="move" AND log_type="move"), the log_page field in the logging table contains the page_id of the *redirect page* that was created or zero if no redirect was created.  

This is painful and counter-intuitive.  I expected log_page to contain the page_id of the *moved page*. 

Proposed solution:

Store the page_id of the *moved page* in log_page for future moves and store the page_id of the newly created redirect in the serialized array stored in log_params.

Potential issues:

This may cause backwards compatibility issues if someone is relying on the redirect page_id appearing in log_page for a page move.
Comment 1 Matthew Flaschen 2013-11-14 23:29:23 UTC
It is counter-intuitive.

I'm not sure if we should break backwards compatibility.  I would normally want to keep it.  However, the page ID is 0 when suppressing the redirect, which means there's no consistent way to get even the source page ID (though you can get the source on regular moves, and people may be relying on this).

An option that should not break backwards compatibility is to add the old page ID in log_params (and somehow expose this to the move-specific sub-object in logevents).  See https://en.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=200&letype=move&format=jsonfm
Comment 2 Aaron Halfaker 2013-11-15 01:28:24 UTC
I'd like to point out that adding the *moved page* id to log_params is sub-optimal since log_params can't be indexed in such a way that discovering move events for a particular page would be efficient.  

However, if there's a serious concern about backwards compatibility, I find this to be a good compromise.
Comment 3 Dario Taraborelli 2013-12-23 17:46:56 UTC
Have we considered the possibility of server-side instrumentation? We introduced ServerSideAccountCreation precisely for the same reasons: (1) generate a clean dataset for analysis; (2) add fields that are not natively available from MW (like mobile account creations); (3) avoid creating backward compatibility issues.

Obviously EventLogging doesn't address the problem of historical data and would only produce data that is not (yet?) publicly accessible.
Comment 4 Aaron Halfaker 2013-12-23 18:21:19 UTC
This sounds like an orthogonal proposal.  Regardless of whether we instrument an EventLogging capture, log_page should still capture the deleted page's ID.
Comment 5 Aaron Halfaker 2013-12-23 18:29:28 UTC
I mistakenly wrote the comment above thinking that I was commenting on bug #26122.  Yet, I feel like the argument still applies here, though to a lesser extent due to potential backwards compatibility issues.
Comment 6 Matthew Flaschen 2013-12-24 06:22:11 UTC
(In reply to comment #5)
> I mistakenly wrote the comment above thinking that I was commenting on bug
> #26122.  Yet, I feel like the argument still applies here, though to a lesser
> extent due to potential backwards compatibility issues.

I agree.  EventLogging is a useful tool for the WMF, but it is not a substitute for MediaWiki's built-in logging.

Built-in logging is important for third-party MW installs, on-wiki users (though obviously not everything is visible to all users), and the pages-logging.xml.gz dumps (though of course, what to dump requires consideration).

Backwards compatibility is important, but it's not the only thing.  Sometimes we have to break it to move forward.

As noted, I think it can be solved for built-in logging either way (with or without backwards compatibility); they have different advantages.
Comment 7 Aaron Halfaker 2013-12-27 15:53:56 UTC
I just wanted to leave another note because I've run into this issue again.  

I'm working on building a graph of the use of draft namespace[1] in the English Wikipedia over time.  This work is made substantially more difficult by the lack of a persistent identifier of the moved page associated with move events.  Since move events change the non-persistent identifier of a page (namespace & title), and draft articles are intended to be moved, I need to reprocess the entire log in order to keep track of these changing identifiers every time I need to update the graph's data.

1. https://en.wikipedia.org/wiki/Wikipedia:Draft_namespace
Comment 8 Nathan Larson 2014-07-31 16:00:53 UTC
Shall we add another field to the logging table? Here are some possibilities:

*log_from_page
*log_source_page
*log_old_page

*log_to_page
*log_destination_page
*log_target_page
*log_new_page

I was thinking we'd want to avoid a name like log_moved_from because we might want to also use that field for, e.g., merging histories or some other unforeseen use, besides page moves, in which there's both a "from" page and a "to" page.

log_new_page has the disadvantage that we might not always be creating a new page, so it would be misleading.

"source" could have the disadvantage of being confused with rc_source. "old" could have the disadvantage of being confused with the fields in the text table that start with old_; plus it suggests that the page_id perhaps has been supplanted with a new page_id, which is misleading.
Comment 9 Nathan Larson 2014-07-31 16:05:23 UTC
Suppose we go with the log_params option. How would that work? The way it's currently set up is as a serialized array like this:

array(2) {
  ["4::target"]=>
  string(3) "Foo"
  ["5::noredir"]=>
  string(1) "0"
}

Shall we add a "6::pageid" element?
Comment 10 Aaron Halfaker 2014-07-31 17:31:59 UTC
I was actually just about to request something similar.  I've been working on enumerating metadata about public actions in MediaWiki and flagging some data values that are (painfully) currently missing from the logs.  See https://meta.wikimedia.org/wiki/Research:Ideas/MediaWiki_events:_a_generalized_public_event_datasource

Adding a field to params seems like a fine option to me in this circumstance.
Comment 11 Nathan Larson 2014-07-31 23:39:05 UTC
The more I think about it, the more it seems like a good idea to put the "moved from" page ID in log_page, and put the redirect page ID in log_params. How much of a big deal would the backwards compatibility issues be? We might actually be doing the bot owners more of a favor than a disservice by making this change, since their implementations could become more straightforward.
Comment 12 Aaron Halfaker 2014-08-01 00:26:23 UTC
I couldn't agree more.  I can't imagine a use-case where the created redirect page benefits from a position in an indexed column (log_page).

Putting the page_id of the moved page in log_page fits with the principle of least surprise.
Comment 13 Matthew Flaschen 2014-08-01 00:50:37 UTC
(In reply to Aaron Halfaker from comment #12)
> I couldn't agree more.  I can't imagine a use-case where the created
> redirect page benefits from a position in an indexed column (log_page).

Also, as re backwards compatibility, if there is someone using the ID of the created redirect, they may be inconvenienced somewhat by the change.  However, they won't be completely out of luck, since it's still in log_params.

Also importantly, if you need the redirect page ID, it's possible to programatically determine where to find it, even if you have to deal with data sets spanning the transition.  I believe the algorithm would be:

1. If log_page is 0, it's old-style and there is no redirect.
2. If log_page is non-zero:
2a. If there is a log_redirect_page_id (or whatever) in the log_params, it's new-style, and that's obviously the redirect page ID.
2b. If there is no log_redirect_page_id, it's old-style, and log_page is the redirect page ID.
Comment 14 Gerrit Notification Bot 2014-08-01 02:32:38 UTC
Change 150969 had a related patch set uploaded by leucosticte:
Make log_params contain null revision rev_id

https://gerrit.wikimedia.org/r/150969
Comment 15 Nathan Larson 2014-08-01 02:36:45 UTC
This patch adds a "redirpageid" log parameter to the API that only shows up when there's a redirect page created. So, e.g., suppose "Foo" is moved to "Bar" with a redirect. You'd get something like this for list=recentchanges:

      <rc type="log" ns="0" title="Foo" pageid="15" revid="0" old_revid="0" rcid="15" logid="14" logtype="move" logaction="move">
        <move new_ns="0" new_title="Bar" nullrevid="28" redirpageid="16" />

But suppose the redirect is suppressed; then you'd get something like this:

      <rc type="log" ns="0" title="Foo" pageid="15" revid="0" old_revid="0" rcid="15" logid="14" logtype="move" logaction="move">
        <move new_ns="0" new_title="Bar" suppressedredirect="" nullrevid="28" />

Is that what you want, or do you want redirpageid to show up as, e.g., 0 if the redirect is suppressed?
Comment 16 Nathan Larson 2014-08-01 02:49:31 UTC
This patch also, incidentally, fixes bug 68950.
Comment 17 Aaron Halfaker 2014-08-01 12:39:16 UTC
When considering this change, I like to consider the EventLogging schema that we set up for tracking page moves on Wikimedia wikis given the lack of useful data in the logging table.  See https://meta.wikimedia.org/wiki/Schema:PageMove.  

In matching that schema to the proposed data structure:

* userId -- rc.userid
* userText -- rc.user
* pageId -- rc.pageid
* oldNamespace -- rc.ns
* oldTitle -- parsed(rc.title) (to remove namespace prefix)
* newNamespace -- rc.move.new_ns
* newTitle -- parsed(rc.move.new_title) (to remove namespace prefix)
* redirectId -- rc.move.redirpageid
* comment -- rc.comment

Is that mapping right?  If so, it makes sense to me.  

Also, re. zero for rc.move.redirpageid, we decided to go that way in the EventLogging schema since it follows the pattern of zero'd ID fields that is relatively consistent across the database (e.g. rev_user and rev_parent_id in revision).
Comment 18 Nathan Larson 2014-08-01 13:34:58 UTC
(In reply to Aaron Halfaker from comment #17)
Actually, the way the API is set up now, neither rc.title nor rc.move.new_title are parsed to remove the namespace prefix. This means that bots have to do that parsing. We could introduce a new result property for the title without namespace prefix. I'm not sure what we'd call it, though.

This latest patch uses zero for rc.move.redirpageid if the redirect is suppressed. So it would appear in the API as:

      <rc type="log" ns="0" title="Foo" pageid="15" revid="0" old_revid="0" rcid="15" logid="14" logtype="move" logaction="move">
        <move new_ns="0" new_title="Bar" suppressedredirect="" nullrevid="28" redirpageid="0"/>
Comment 19 Nathan Larson 2014-08-01 13:58:05 UTC
Brian Wolff says, "Im opposed to suddenly changing the definition of log_page in a breaking way without a compelling reason." Is there a compelling reason, and is the breakage a big deal? Should we ask at wikitech-l?
Comment 20 Aaron Halfaker 2014-08-01 14:21:30 UTC
Well, I think we do have a compelling reason.  I am also a fan of asking wikitech-l to help us imagine what might be broken by the change.
Comment 21 Aaron Halfaker 2014-08-01 14:27:10 UTC
(In reply to Nathan Larson from comment #18)
> We could introduce a new result property for the title without 
> namespace prefix. 

This would be useful.  I end up doing a lot of work to extract namespaces from full page names.  Every time I realize I'm going to need to do it, I feel a deep sadness as parsing titles tends to be error prone.  Sadly, splitting on ":" is not a sufficient strategy.  I've actually written libraries to help minimize the pain for others.  See https://pythonhosted.org/mediawiki-utilities/lib/title.html#mw-lib-title
Comment 22 Nathan Larson 2014-08-01 14:45:01 UTC
How would we handle existing log items? Per what was written in comment #13 above, I was thinking we could write a script that would look for rows lacking the redirpageid log parameter. Those would be the ones before the change was made.

Then it would look at the log_page and see where that redirect points to. It would need to look in the revision history, or maybe even the archive table (if the redirect page was later deleted), to find the revision that was created with the same rev_timestamp or ar_timestamp as the log_timestamp.

Then it could look up the page title from that redirect revision in the page table to get the page_id to put in log_page.

That page could have been meanwhile moved to a different page title, or deleted and undeleted, either of which would make the page title have a different page ID than what it was when the page move occurred. So it might be necessary for the script to follow the trail in the logging table till it gets to the correct page ID. That will be an interesting engineering challenge.

Alternatively, we can take the easy way out and simply fix bug 68930. Then we wouldn't be making any breaking changes.
Comment 23 Aaron Halfaker 2014-08-01 15:08:04 UTC
I'm a fan of the non-easy way and I'm willing to put into the work to build a script that would handle the engineering challenge of back-filling old data.  

Doing this work will take some time (a couple of weeks?) and I'd prefer if it was done after Wikimania.  I've already done some related work, so I'm familiar with the complexities that would be involved.  See https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation#Datasets

FWIW, I'm planning to do this work for other reasons.  See https://meta.wikimedia.org/wiki/Research:Ideas/MediaWiki_events:_a_generalized_public_event_datasource
Comment 24 Nathan Larson 2014-08-01 15:27:41 UTC
(In reply to Aaron Halfaker from comment #23)
> I'm a fan of the non-easy way and I'm willing to put into the work to build
> a script that would handle the engineering challenge of back-filling old
> data.

I've posted some discussion at [[mw:log_page migration script]].
Comment 25 Matthew Flaschen 2014-08-01 15:57:00 UTC
(In reply to Nathan Larson from comment #18)
> (In reply to Aaron Halfaker from comment #17)
> Actually, the way the API is set up now, neither rc.title nor
> rc.move.new_title are parsed to remove the namespace prefix. This means that
> bots have to do that parsing. We could introduce a new result property for
> the title without namespace prefix. I'm not sure what we'd call it, though.

Not sure it's necessary.  It may be that most people who need to do it already have a library that does so.

Anyway, it would be outside the scope of this change.
Comment 26 Matthew Flaschen 2014-08-01 15:59:37 UTC
(In reply to Nathan Larson from comment #22)
> That page could have been meanwhile moved to a different page title, or
> deleted and undeleted, either of which would make the page title have a
> different page ID than what it was when the page move occurred.

Or been suppressed/oversighted, etc.

> So it might
> be necessary for the script to follow the trail in the logging table till it
> gets to the correct page ID. That will be an interesting engineering
> challenge.

Indeed, definitely interesting, possibly feasible.
Comment 27 Bawolff (Brian Wolff) 2014-08-01 16:48:51 UTC
(In reply to Aaron Halfaker from comment #20)
> Well, I think we do have a compelling reason.  I am also a fan of asking
> wikitech-l to help us imagine what might be broken by the change.

Well people should include said compelling reason in the commit summary. Right now there is no reason mentioned in the commit summary compelling or otherwise.

Keep in mind that changing this value would break every existing usage of this value, including not only api users, but possibly tool labs folks who use the db. It would also change how namespace filtering works on special:recentchanges. To me that seems quite a breaking change.
Comment 28 Nathan Larson 2014-08-01 17:04:50 UTC
(In reply to Bawolff (Brian Wolff) from comment #27)
> Keep in mind that changing this value would break every existing usage of
> this value, including not only api users, but possibly tool labs folks who
> use the db. It would also change how namespace filtering works on
> special:recentchanges. To me that seems quite a breaking change.

I wonder how many people are using this value? This is kind of a catch-22, but the fact that people weren't clamoring for this change to be made earlier suggests to me that not many people are using the value, or they would have noticed how counterintuitive the situation was and said something. I'm not very familiar with tool labs; if there a way we can do a keyword search to see what uses they're making of log_page?

How does it change namespace filtering on Special:RecentChanges? It seems like with or without the change in effect, when you filter by namespace, results show up or don't based on the value of rc_namespace, i.e. the namespace you're moving the page from (rather than to).
Comment 29 Bawolff (Brian Wolff) 2014-08-01 17:23:29 UTC
I assumed you wanted to change that value too (to be consistent).


----

A less backwards breaking change could be to add a target_page_id entry to the log_search table.
Comment 30 Nathan Larson 2014-08-01 19:20:18 UTC
(In reply to Bawolff (Brian Wolff) from comment #29)
> A less backwards breaking change could be to add a target_page_id entry to
> the log_search table.

This is implemented in the patch set 5, Gerrit change #150969. Should we WONTFIX this bug, then?
Comment 31 Matthew Flaschen 2014-08-01 22:18:11 UTC
(In reply to Nathan Larson from comment #30)
> (In reply to Bawolff (Brian Wolff) from comment #29)
> > A less backwards breaking change could be to add a target_page_id entry to
> > the log_search table.
> 
> This is implemented in the patch set 5, Gerrit change #150969. Should we
> WONTFIX this bug, then?

The log_search approach seems like a good one to me.
Comment 32 Aaron Schulz 2014-09-02 19:59:55 UTC
Ran into this in bug 69311
Comment 33 Gerrit Notification Bot 2014-09-02 20:03:24 UTC
Change 157872 had a related patch set uploaded by Aaron Schulz:
Move log log_page entries are now that of the moved page

https://gerrit.wikimedia.org/r/157872
Comment 34 Gerrit Notification Bot 2014-09-05 15:22:34 UTC
Change 157872 merged by jenkins-bot:
Move log log_page entries are now that of the moved page

https://gerrit.wikimedia.org/r/157872
Comment 35 Nemo 2014-09-05 15:23:32 UTC
This is a quite drastic change, needs communication. (At least it has release notes.) I suggest to describe it as "Log entries for the move log are now stored for the target title rather than the original title", as this is effectively what users see isn't it.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links