Last modified: 2014-09-23 22:36:29 UTC
When a user is editing a page, would there be a way to have the system automatically check for related articles. Ex. A user adds the word physics, but doesn't link it because he doesn't know that there is an article about physics. This automatic linking, the system would check the newly edited part of the article and see if any of the words match article names and change the word to a link to that article.
changed subject from "automatic linking" to "automatic wikification"
Changed the product to MediaWiki extensions.
Created attachment 741 [details] First draft of extension First draft of automatic wikification extension. It needs some work in the regular expression arena. It is designed to work with the 1.5 db layout.
Wow. Someone decided to make this extension. Thanks. I'm not that knowledgable with coding unfortunately, but I will try to help as much as I can. I've already read the code and thanks to your comments have been able to understand a majority of it. I already have a couple comments that I hope will help. Does it allow you to do custom namespaces with the extension? And how many sql queries will this generate? I know my website provider has a limit of # of sql queries per user per hour. Also, if you would like to test on a wiki (other than the CWRU wiki that you run) you can use my wiki if you'd like, which should be up soon.
The good news about this extension is that it only generates 1 SQL query. The bad news about the SQL query is that it can be massive, depending on the size of the article being saved. However, this giant query only gets executed when a page is actually saved. There are currently some limitations to the extension. The primary limitations are the poorly written regular expressions. As it stands, the replacement regular expression is the worst. It will replace text, but will mess up formatting in the process. In addition, the script does not yet support namespaces other than the main namespace. This change should be trivial, however. Functions for generating links to internal topics can be found in 'includes/Title.php' (I believe). Also, automatic wikification, although it sounds cool, has some drawbacks. When I ran it on some test articles on http://wiki.case.edu, it would convert common words like "case" to links because "Case" is the shorthand name of my university. Unless I am mistaken, the MediaWiki hook system does not allow you to return the text from the pre-save hook (which this extension is) and have the user verify it. If this extension is to become used in production environments, it will need some attending by those with more experience with regular expressions than I. Once those problems are fixed, I will attend to fixing the other issues.
Sorry, but what exactly do you mean by "regular expressions"? Also, to reduce the size of the query, would it be helpful to have the extension determine what is different between the new and old versions before scanning for links? That won't add links to articles created since the last time the entire article was scanned, so maybe not. And you're right, it might be best to have readers check it before it adds the links. If that's not possible, something else would have to be done.
A regular expression is a method to match text patterns. They are a very powerful tool. See http://en.wikipedia.org/wiki/Regular_expression for more info. Finding a diff between versions and then doing the substitution would be very difficult. You would have to extract the old contents, run a search on the new terms, and somehow do a string replace on the autowikified links only in the new text. The last part seems a bit challenging. In all honesty, I think it would be more beneficial to spend time writing thorough documentation on creating links than working on this extension. When it comes to creating content, humans will always be able to do a better job than computers. Automatic wikification, although cool, will not always be perfect. An alternative to investigate would be a tool run by experienced wiki users that scans articles for possible links and prompts whether to change the text into a link.
I don't know why I didn't think of this before, but could an exclude list / key / attribute / column / whatever be the solution to at least one problem. For example, make the "case" article exempt from automatic wikification. This can be done whatever way makes it easiest to code. This would eliminate one major problem of words with multiple meanings being turned into links when they shouldn't be.
I feel that if someone who knows more coding could work on this, it could be made much better.
I will help with the regexes.
Not sure how the regex stuff is going, but I have another question. Is it possible to run this just once through the database by running the file on the internet, or does it have to be done when pages are saved? What would I have to change to get it to work that way?
I've been working on getting this to run for all pages in the main namespace at one time, and it has become very confusing and frustrating. If ANYONE can help out that knows MediaWiki and PHP, their help would be greatly appreciated. Thanks.
Created attachment 1266 [details] Second draft of extension The second draft fixed a bug in the first draft that would take out the space before the word that is linked. Also, the extension does not seem to be linking phrases, although it should. Hopefully I will figure out how to make an exclude list soon.
Didn't you guys think about the possibility where this autowikification tool links to too many articles. Let's face it, en: wikipedia is a big one and there are a lot of articles about lots of different stuff. The result of this can be almost totally blue text. This is somewhat unwanted. On the other hand, small wikis have little articles and this could barely help. All in all, I think it's a good idea, but it needs human control, IMO. After all, let's not forget that red links aren't bad in small wikis - they are, conversely, helpful and good for the project, but red links aren't a part of this extension, so I'll shut up. :)
Yes, I never expected this to be used on en: wikipedia. It would be used on small to medium wikis, mainly to add links to things that the author didn't know about. I do agree, however, that there should be some human control. Maybe instead of automatically adding the links, instead just suggesting them and allowing the user to choose which to include. Soon, I will upload a new version, one that now includes a way to exclude pages. For example, my wiki's about page kept linking. With the exclude list, you can add the word "about" as a word to exclude. You can also use this to keep the number of links down. Lastly, you're right, red links are good, but like said, this extension doesn't do anything with them.
Created attachment 1270 [details] Version 0.3 This new version includes a way to exclude words by modifying the $excludelist array. Eventually, you will be able to set this in localsettings.php. Also, linking phrases now works correctly. Unfortunately, new bugs have been discovered. The extension will not link the last word in the article and words with periods or commas (ex. home,) will not be handled correctly.
For the German Wikipedia there is a wikifier on: http://217.160.138.71/development/wikipedia/wikify/ It works fine and could exemplify for this request.
Unfortunately I don't speak German, so if there is anyone that could translate this to English or provide the code (with English comments) here, that would be great.
*** Bug 4886 has been marked as a duplicate of this bug. ***
A note on running this on all pages once; if well-written, then a wrapper around the code could be provided in the form of a custom maintenance script which could rip through all article pages.
*** Bug 7015 has been marked as a duplicate of this bug. ***
Is there a possibility to add functionality of including wikification only from a "whitelist" (nothing else would be considered). I mean the contrast to posting #16: $includelist. Thanks for support!
Created attachment 3826 [details] Version 0.45
Adding "need-review" keyword to indicate extension awaits review. jediarchives11, you might want to check whether an extension like this already exists (look on mediawiki.org) - if it doesn't, you should probably update your extension to work with MediaWiki as it is now, and then follow these instructions: https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment
Rm patch-needs-review - never going to be deployed on WMF for non-technical reasons. Technically, works only with $wgDBprefix = 'wiki', uses raw SQL, pegs master with requests perfectly suitable for slaves.