Today, this blog got its first ever spam, via the trackback interface. How annoying. Here's how I've stopped it (yes, the regexes could be better, and the parse_url() call eliminated, but its late and this is a quick hack):
<?php function ne_rbl_check($ip) { static $lists = array('.sbl-xbl.spamhaus.org'); $ip = gethostbyname($ip); foreach ($lists as $bl) { $octets = explode('.', $ip); $octets = array_reverse($octets); $h = implode('.', $octets) . $bl; $x = gethostbyname($h); if ($h != $x) { return false; } } return true; } function ne_surbl_checks() { $things = func_get_args(); foreach ($things as $thing) { if (preg_match('/^\d+\.\d+\.\d+\.\d+$/', $thing)) { if (!ne_rbl_check($thing)) return false; } if (preg_match_all('~(http|https|ftp|news|gopher)://([^ ]+)~si', $thing, $m = array(), PREG_SET_ORDER)) { foreach ($m as $match) { $url = parse_url($match[0]); if (!ne_rbl_check($url['host'])) return false; } } } return true; } ?>
These two functions implement RBL and SURBL checks. RBLs, as you probably already know, are real-time block lists; you can look up an IP address in a block list using DNS, and if you get a record back, that address is in the block list. The first of the two functions implements this, in a bit of a lame hackish way.
The second function implements content-based checks, commonly known as SURBL; the text is scanned for things that look like IP addresses or URLs; those IP addresses or host names are extracted from the content and then looked up in the RBL using the first function.
Why is this good? A comment spammer will typically want to inject a link to their site onto your blog, and you can be fairly sure that their site is listed in a good RBL. The RBL used in my sample above is an aggregation of the SBL and XBL lists which contain known spammers and known zombie/exploited machines, so it should do the job perfectly.
Now to hook it up to the blog; this snippet is taken from my trackback interface:
<?php if (!ne_surbl_checks(get_ip(), $_REQUEST['excerpt'], $_REQUEST['url'], $_REQUEST['blog_name'])) { respond('you appear to be on SBL/XBL, or referring to content that is', 1); } ?>
get_ip() is a function to determine the IP address of the person submitting the page; I haven't included it here for the sake of brevity; it's fairly simple to code one, but keep in mind that it needs to be aware of http proxies. respond() returns an appropriate error message to the person making the trackback and exits the script.
And that's all there is to it; you can do similar things with your comments submission and pingback interfaces.
Enjoy.
Completely off-topic (apologies in advance)
On the UAE homepage I noticed that you (I think) wrote a patch to give that emulator MMU support. I wish to run NetBSD on UAE to use the Sun3 compatibility layer (long story) to execute an ancient ADA cross-compiler.
Do you still have the binaries from when you were working on this project? I briefly tried to compile it last night, and ran into problems. I will try to resolve those tonight, but I decided to try the easy approach as well by contacting you in the meantime.
Also, might you have any other ideas for running NetBSD on virtual 68k hardware?
Thank you, Toby
The MMU Emulation wasn't complete enough to get to init on linux though, and was really really slow. IIRC, there was an Atari m68k emulator that might have had MMU support. You might be better off just trying to find someone actually running NetBSD/m68k hardware and asking them for a shell.
As you should already be aware, using an RBL (or any number of RBL combinations) is hardly an effective solution. I had 17 attempts, each from a unique ip address within a 3 minute time frame. Only half of them showed up in a multi-RBL check.
