Mediawiki Despam

Previous Step:Translate
Next Step: Load

If you've been running a closed wiki and you're sure you have no spam, just skip this page.

On the other hand, if half of your edits were made by spambots, it's time to dump that dreck permanently. Removing all the spam and anti-spam edits reduced my site's total size from 60MB to 20MB and from 2006 edits down to 1272.. My site was almost 1/2 spam if you count edits and, if you count by total volume of each revision, it was 2/3 spam! Spam edits are usually quite large, usually adding around 10K per edit. Defacements, on the other hand, tend to remove more than they add.

Find the Spam

First, make sure you're working with an exploded directory of revisions. If not, go back to the Translate step.

Now, how can we find spammy edits? A bunch of ways.

Grep for it

Just look for spam.

   $ grep -r 'v[1il]aga?ra' *

Hunt it and delete it in one step:

   $ grep -lZir pu55y destdir | xargs -0 git rm
   $ git commit -m "Remove lame pu55y spam."

Check the file sizes

You can see it in the file sizes. For example, the Car page is 4.3K. Spam edits cause it to grow dramatically, revert edits return the page to its original size:

   -rw-r--r-- 1 bronson bronson 4.3K 2007-07-14 09:10 Car-20070714-161051
   -rw-r--r-- 1 bronson bronson  13K 2007-08-17 07:15 Car-20070817-141555
   -rw-r--r-- 1 bronson bronson  27K 2007-08-19 09:29 Car-20070819-162947
   -rw-r--r-- 1 bronson bronson  39K 2007-08-21 09:25 Car-20070821-162517
   -rw-r--r-- 1 bronson bronson 4.3K 2007-08-21 15:47 Car-20070821-224749
   -rw-r--r-- 1 bronson bronson  16K 2007-08-23 16:06 Car-20070823-230623
   -rw-r--r-- 1 bronson bronson  30K 2007-08-25 21:20 Car-20070826-042054
   -rw-r--r-- 1 bronson bronson 4.3K 2007-08-26 17:04 Car-20070827-000450

All of these edits except for the first one can be removed. Without the spam edits, the revert edits turn into no-ops.

etc

Please edit this page to add techniques that you find useful.

I played with a few more techniques, like 'find -newer' to bulk delete every revision after a known last-good revision.

Typical Spam

Here are some of the spammy things I saw.

Tiny Defacings

As mentioned in the description of the iki-diff-next command, some defacings are tiny:

   ########## changes made by nodes/Git-20071104-065732
   @@ -1,3 +1,4 @@
   +ergetorelta
   This page contains some notes that I took while I was learning git.
   
   == Support ==

I also got hit with "varacda", "pastaro" and other nonsense words. Is this just tagging?

Delete a Lot, Add a Little

You need to also be wary of changesets that delete a lot and add a little.

   -=== Install RadRails ===
   -
   -There have been significant improvements since the last RadRails beta so we'll tell Aptana to install the most recent milestone.
   -
   -* Window -> Preferences -> Install/Update
   -* Paste this URL into the Policy URL field: http://update.aptana.com/update/rails/beta/3.2/policy.xml
   -* Click OK
   -
   -Now, close Eclipse's Welcome tab.  You should see the Aptana Start Page (if not, Help -> Aptana Start Page).  At the bottom of the Plugins view you should see "Aptana RadRails".  Click on "install" and let Eclipse restart again.
   +fhexo mvpz tzvhxjusb jwxslzbm nepjyioz pcdmnt kxwrvqn

I missed this one in the first scan through the file. Except for the last line, it looks like a decent enough edit. If it wasn't accompanied by the comment "aeosxvzlf vszl", I probably would have missed it entirely.

Spammy Words

Here are some words that were heavily used in spam and hardly appeared at all on my site. Beware, NSFW!

   egrep -ir "interracial|porn|shemale|sex|adderall|nudity| \
       lviv|cock|adult-|russian|bankruptcy|personals|bondage| \
       cock|pussy|whore|loacal|cognac|pharma|webcam|spycam| \
       CoffeeCup" . | less

Also,for some reaon a lot of pages were randomly defaced with "1000" always appearing at the beginning of the line.

   grep -r "^1000"

That showed one or two legitimate pages of course, but it also showed a lot of defacings too.

Spam Links

If you're worried about spam, it's probably worth looking at all instances of http:// on your site.

   egrep -r "https?://" .

You can instantly recognize most links as good