= resources = * [http://www.mediawiki.org/wiki/Manual:%24wgSpamRegex mediawiki: spamblocking regex] * WikiSpamExamples * http://wiki.tcl.tk/12559 * [http://en.wikipedia.org/wiki/Spamdexing Wikipedia: Spamdexing] * [http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball Wiki: WikiSpam] * http://wiki.chongqed.org/ * Some interesting and relevant discussion at [http://www.usemod.com/cgi-bin/mb.pl?TheSeptemberThatNeverEnded Meatball Wiki: The September That Never Ended]. * [http://opensource.mit.edu/papers/neus.pdf A scholarly paper (PDF)] = spam logging, Oct. 2006 = <[[Brennen]]> So, out of curiosity since things were still slipping past the simple measures I had in place, I started doing some cheap spam logging here, dumping page names, and spam text to a file for anonymous users who try to post URLs. Over a couple of weeks, I collected several megs of stuff without any false positives. All but a few instances of it seem to be from two sources, each hitting a specific single page over and over again. One bot dumps random text into the "add your response" textarea on the page; the other follows the "edit" link and tries to overwrite the entire page. Yesterday it ocurred to me that I ought to be collecting IPs, so I added that and started the log fresh. Since most of the spambots here seem so unsophisticated (these wiki & comment spammers clearly haven't have to find workarounds for the same level of countermeasures as with e-mail, yet), and appear to keep grinding right along without checking whether their input is accepted (this is one reason I assume these are *bots* at work, rather than human data peons), I'm wondering how hard it would be to go on the offensive, of if it'd just be better to blacklist the IPs, or what... (Let them keep blissfully posting spam that goes nowhere, so as to avoid a general escalation and the eventual necessity of authenticated logins?) = actions taken = <[[Brennen]]> Added: Disallow: /wala.cgi?action=edit to my [http://p1k3.com/robots.txt robots.txt file]. We'll see if it makes a difference. <[[Brennen]]> I'll note that maybe a month later (don't remember when I did this), the spam seems to have dropped off. = January 2006 = Some bot repeatedly writing drug spam to [[linkdumparchive]] (a lowercased version of LinkDumpArchive) with text along the lines of "Your site is amaizing. Can I share some resources with you?" & a bunch o' links. I finally just did a chmod a-w on the file, will think about a real fix. = February 2006 = [[Brent]] added some rudimentary stuff to knock out this last spambot; anonymous users now can't post links through the add box. ---- = older local discussion = <[[Brennen]]> (from [[linkdump]]) Clearly, Wala.pm needs to be modified to implement basic spam-blocking. Current WikiSpamming techniques are ridiculously crude and easy to roll back, but their frequency is going up and they will get to be more sophisticated. What is Wiki Spamming? The pages I see about Wiki Spamming talk about how it's made and how to stop it, but not what it is. Examples? <[[Brennen]]> The wiki-based discussion I've linked to is reasonably deep; in summary, the basic features of wiki (transparent editing, generally without access controls) make it extraordinarily vulnerable to spam. This is mostly a problem because Google-style search engines (theoretically) pay attention to all of the links they find when ranking search results. What I've seen here, on a Wala.pm based system, has been relatively limited in scope: A slew of URLs in various formats dumped on a single page through either the edit form or the "add your response" box. It's happened maybe 6 or 7 times by now, and to the best of my recollection only on pages which are within a single click of the p1k3.com/ root (linkdump and HomePage, mostly). Wala.pm is perhaps slightly more vulnerable than many wiki systems just because of the IRC-style add box at the bottom of each page. I haven't taken the time to track down offending IPs or clients, and haven't a clue whether this is being done manually or via a script which just runs around pumping text into any form it finds. The sheer randomness of it leads me to assume that either way, the perpetrator (code or wetware) is not particularly bright, but I guess brute force makes sense. What worries me more, at this frequency of attack, is the possibility of a script clever enough to, say, rewrite all of the existing URLs on a page. Wiki syntax is relatively consistent and easy to parse; this would be anything but difficult. I have been toying with different ideas for a simple technical fix; ideally it would be something transparent to users without making the wala any harder to update. As usual, the perfect is the enemy of the good. Possibilities: * No anonymous post/edit - require a login for everything. (One step further is password control for logins, but I'd rather avoid it.) * Some sort of key for every post/edit - a lot of weblog comment systems, forums, and registration forms require users to read a number or string and input it. This is annoying and time-consuming. * In a variation on the above, require a key only for anonymous changes and logins. This is slightly better but still annoying. * A key could be incorporated in the edit/post itself - reject every post that doesn't contain a certain string, or maybe require something in the Summary field for every change. This is a little more interesting, but also sounds annoying. * An "I am not a spambot." checkbox next to every submit button. Annoying and would probably hurt usability a lot, but there might be ways to make it seem more natural. * Disable editing on certain pages. HomePage, etc., and I can maintain these with a server-side text editor. Lame. * Disable external links altogether. Supremely lame. * Keep Google & co. from indexing these pages using robots.txt. Loses good functionality. * Only show add box and "edit this page" to certain clients. Also lossy. * Blacklists and content-based blocking. This gets us into arms-race territory; I am disinclined to spend the time or mental energy. That said, a very simple blacklist mechanism would be easy to build. * Whitelisting: This actually appeals to me a lot more than a blacklist. (Or more than ''just'' a blacklist.) What I have in mind is a special wala page containing a list of allowed nicks. In order to post or edit, you've got to be on the list. This has all the problems of the other access control schemes, but the idea of making it completely user-maintained is really appealing. There might be ways to make it less noxious, such as automatically adding new logins to the list. <[[Brent]]> If you could corral a few examples so I can inspect them, I'd appreciate that. I'd prefer to prevent spam-like content; if I had some examples, I would probably add a feature that works as follows: If an edit is flagged as spam-like, the edit is re-displayed to the user, who must check an "I am not a spambot"-type checkbox before the edit is actually posted. So, for example, any edit that is only a link, or contains at least two links, would be flagged and would require an additional step to be posted. I could add whitelisting as follows: if a page named WhiteList exists, each line on that page is used as a nick in the list of allowed nicks. If the page doesn't exist (the default), no whitelisting occurs. A similar scheme would work for a page named BlackList. <[[Brennen]]> I'm at work at the moment, but I'll put up WikiSpamExamples at some point today. WhiteList seems like a good idea, if it was simply based on the principle that whitelisted nicks don't go through spam-checking. BlackList I think would have to be IP-based to do any good at all. A decent heuristic for spam content might be anything that's more than 50% links. <[[Brent]]> Good points, all. It's difficult for me to tell when an edit contains a certain percentage of links, since Wala.pm uses regular expressions to convert links. I can count the number of URLs in the edit, though, so that might be the way to go. In my experience, it's unusual for me to add more than one URL to a wiki or wala page in a single edit. <[[Brennen]]> The way I've been using this one lately, that's not so much true - I'm basically thinking of this as a more practical bookmark list or repository of interesting search terms - but since a login is so straightforward, that wouldn't be a problem. Just as an exercise, it seems to me that the simplest way to look at percentage of links would be do it by word: @words = split / /, $pagetext; for (@words) { if ( m/(http:\/\/)/i ) { # is a URL $link_count++; } else { # not a URL $plainword_count++; } } Or something like that. <[[Brennen]]> Added WikiSpamExamples. <[[James]]> I think a lot of the spam comes from bots, from my logs earlier this month: 212.109.211.118 - - [02/Nov/2004:05:05:21 +0000] "GET /cgi-bin/wala.pl?action=edit&id=HomePage HTTP/1.1" 200 741 "http://www.google.ru/search?q=inurl:%3Faction%3Dedit+homepage&num=20&hl=ru&lr=&start=40&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; dial; .NET CLR 1.1.4322)" I imagine changing the action variable name or value will reduce most automated spam than is coming in via search engines. Obviously this approach is very naive and doesn't really address the problem of determining if a post is spam or not. <[[Brennen]]> Good idea, though. Using robots.txt, can we tell Google and other bots not to index editpages? This ought to be done anyway if they're just cluttering up search results.