Last modified: 2014-09-16 17:27:47 UTC
cf an email from Sebastian Hellman in February 2013 "We built *a lot* of infrastructure which depends on the updates. So if the OAI-PMH stream would suddenly not work anymore, it would jeopardize three or four open-source projects and would cause a lot of problems further down the data chain (i.e. people who get the data from us ). So yes, we are still using the OAI-PMH stream and we evven plan to extend the usage to more language versions of Wikipedia and many language versions of Wiktionary. Of course, we are willing to change to the MediaWiki API, if necessary (and we also have to man power to achieve this within several months). There were two major reasons, why we didn't switch, yet: 1. we have a running system, there is no real incentive to switch unless you tell us to. 2. we didn't have a contact from Wikimedia. I wrote one or two emails in the past, but didn't get a response. 3. We did not find any good documentation on how to get *all* updates from Wikipedia. Query RC and then do Special:Export requests? 4. We were afraid to get blocked, since we would be over the 1 request per second limit. We would be happy, if we could get into contact and settle this matter to be compatible with the future. We are in contact with WikiData already (Anja Jentzsch worked on DBpedia before)." I re-enabled oai auditing earlier today, and it would seem at the time of writing this email, that dbpedia are the only user of the OAI interface...
Note that old search used OAI for internal updates at least on some wikis, but this should be gone soon with full CirrusSearch deployment. What's the situation with the new rcstream etc things -- can these be adapted to send page text as well, or do we have a better way for them to do that kind of data fetch?
Since 20140724215329 mysql:wikiadmin@db1038 [oai]> select oa_client, ou_name, count(oa_client) from o aiaudit left join oaiuser on oa_client = ou_id group by oa_client; +-----------+--------------+------------------+ | oa_client | ou_name | count(oa_client) | +-----------+--------------+------------------+ | 0 | NULL | 1055 | | 6 | lsearch2 | 126808 | | 12 | fresheye.com | 5967 | | 13 | dbpedia | 38854 | +-----------+--------------+------------------+ 4 rows in set (0.37 sec)