@internet -- Boris the Spider

I used to be a lot more tolerant of spiders than I am now. Time was, I thought of them as creepy -- but essentially harmless -- animals that serve a useful purpose as pest control agents. I left them alone and I expected them to return the favor. Then I was bitten by a brown recluse. Twice. The first one apparently crawled into my bathrobe while I was sleeping. When awoke, I put it on and sat down to have a cup of coffee and -- WHAM! WHAM! WHAM! WHAM! -- the little sucker nailed me four times, right in a line. It felt like someone had jammed a lit cigar into my fanny -- four times, really fast. I squished him, naturally, but it was small recompense for having to forego sitting down for the next six weeks. The second time -- about a year later -- happened much the same way, except that it was my jeans in which the little so-and-so had taken refuge. That one bit me -- three times -- on the inside of my left thigh, instead of on my tuchus. At least I was able to sit down afterward. San Francisco Chronicle and Salon Magazine columnist Jon Carroll, who has had the same pleasure, described the lesion that formed as "the pimple from Hell". In both my own incidents, the line of wounds that formed varied from half-dollar size down to dime size. They were deep and painful and -- after the "pimples" popped -- left behind raw, nasty holes that took months to fully heal. A few months later, I got to experience the daylong, viselike migraine headace and muscle rigidity that comes free with every black widow bite. That put thirty to any lingering tolerance I felt toward arachnidae. That did it. Now, I ruthlessly squash most spiders I see -- but I still make exceptions for spindly ones like harvestmasters and daddy longlegs, because they're completely harmless to humans. Likewise, among the spiders that live on the Internet, there are those that are beneficial -- like the long-legged one that lives behind the wastebasket in my bathroom and eats the biting Argentine ants that make my ablutions so much fun -- and others that are actively harmful. Look, He's Crawling Up My Wall Search engines such as Alta Vista, HotBot, WebCrawler and the like rely on Web spiders -- robots that traverse HTML document trees and report the text or meta tag contents -- to create indices for their users to search. So do the meta-spammers who use them to harvest lists of email addresses that they then sell to the jerks who so thoughtfully cram our inboxes with junk email. Most ISPs know about ROBOTS.TXT, a file that well-behaved spiders look for guidance on whether and how to catalogue a site they've contacted. It's a ASCII text file that lives at the document tree root and defines those documents and/or directories that conforming spiders are forbidden to index. A ROBOTS.TXT file includes at least one User-agent definition, followed by one or more Disallow statements, each separated by a blank line. It may also begin with one or more comment lines, each of which must be preceded by the # character, and it should end with a blank line. Each User-agent definition identifies the search engine or search engines to which the Disallow statement or statements that follow it applies. That allows you to tailor a ROBOTS.TXT file to forbid some spiders access to certain files or directories while permitting some or all other spiders to access those same files or directories. Here's an example of a ROBOTS.TXT file that prohibits all spiders from accessing all files in a document tree: User-agent: * Disallow: / To impose different restrictions on different spiders, create a different User-agent definition for each spider and follow it with one or more Disallow arguments. For instance: User-agent: Slurp Disallow: / will disallow all access by any Inktomi-based search spider. Search Engine Watch maintains a webmasters' SpiderSpotting Chart that lists the most common spiders and their agent names. Keep in mind, though, that rogue spiders will simply ignore your ROBOTS.TXT file -- which means that it will simply deny access to useful spiders, (i.e. -- those from search engines,) while doing nothing to stop the parasitic ones. Then why should you care about creating an appropriate ROBOTS.TXT file? Because a properly-crafted ROBOTS.TXT file will enable you to reduce the load that legitimate robots place on your web server, thereby increasing the resources it has available to service requests from real, live human beings. How? Well, to begin with, you should disallow all spider access to your CGI-BIN directory and to any and all other directories that contain programs, rather than pages. You should do likewise for all your users -- including those for whom you host virtual domains -- who have individual CGI-BIN directories. (Be careful to set the ownership and permissions of any ROBOTS.TXT files you create in your users' PUBLIC_HTML so that your users can modify them, if they so choose -- remember that one-size-fits-all solutions rarely do.) The easy way to create a ROBOTS.TXT file is to download Rietta Software's RoboGen Visual Exclusion File Editor for Windows 9.x/NT. It's freeware (although you have to register before you can download it) and it includes a pick list of over 180 known User-agents that makes customizing your ROBOTS.TXT file a snap. It'd also be a good idea to create a tutorial page for your users on how to use both ROBOTS.TXT and the ROBOTS HTML meta-tag. That meta-tag takes the form: <META NAME="ROBOTS" CONTENT="[argument]"> where [argument] can be any of NONE, ALL, INDEX, NOINDEX, FOLLOW or NOFOLLOW. (NONE instructs spiders to ignore the page and all links it contains, INDEX allows indexing the page, FOLLOW allows following links, NOINDEX and NOFOLLOW impose the obvious restrictions and ALL gives permission to both index and follow links.) The ROBOTS meta-tag can contain comma-separated multiple arguments, so: <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> gives well-behaved spiders permission to index the page, but not to follow links it contains. The advantage of the ROBOTS meta-tag is, of course, that it gives you and your users a much finer degree of control over the behavior of mainstream spiders on your server -- but, again, rogues will just ignore it. Creepy, Crawly, Creepy, Crawly.. Spiders don't look for FAVICON.ICO, but MSIE 5 does. In mid-April, Wired "discovered" that MSIE 5 requests FAVICON.ICO when a user bookmarks a site, even though Microsoft documented that on their Developer Network back on March 8. The problem is that, since webserver logs track both successful and unsuccessful file requests by the IP address of the requester, this new "feature" in MSIE 5 creates potential invasion of privacy issues for users with static IP address assignments. For users who connect via modem, that's no big deal, since they usually get their addresses dynamically, via DHCP and there's no persistent relationship between a particular user and a specific address. That model is changing fast, though, especially in urban areas, as DSL and cable modems -- both of which require static IP address assignments -- proliferate like jackrabbits. Those folks can be tracked via requests for FAVICON.ICO and that does create a potentially major privacy problem -- a particularly ironic one, in light of Microsoft COO Bob Herbold's keynote address at PC Expo '99 trumpeting both the Redmond technopoly's own privacy policy and its committment to -- eventually -- require sites on which it advertises to post privacy policies of their own. But wait, because, in the words of the immortal Ron Popeil, "there's MORE!" Not only does MSIE 5's automatic request for FAVICON.ICO -- which can't be disabled -- create potential privacy issues, it can also cause the browser to crash. In April, Flavio Veloso discovered the bug and twice reported it to Microsoft -- which, so far, has simply ignored the problem. All it takes is a corrupted FAVICON.ICO file or one that's simply not a Windows icon file -- a text file, for instance. When MSIE 5 retrieves a bad FAVICON.ICO file, it dies in a pile. That's not good. And Microsoft's approach to the problem appears to consist of sticking its corporate head in the sand and hoping that it goes away of its own accord. If you're interested in developing spiders of your own -- or you just want to keep tabs on the field, Multimedia Computing Corporation hosts a robots mailing list. Send mail to listserv@mccmedia.com with "subscribe robots [your name]" in the body, followed by a blank line. (I know you know better, but let your users know that they shouldn't include the quotations marks -- they confuse MCC's listserv -- or a signature -- because it will try to interpret it as additional instructions and return a flood of error messages.) There's also an archive of a previous incarnation of the list at its former home at Webcrawler.

As for me, I now make it a point to shake out my clothing and footwear before I put them on. After all, there might be spiders crawling in them.