-
04-05-05 #1
Registered User
- Join Date
- Aug 2003
- Location
- HUDDERSFIELD, UK
- Posts
- 750
- Thanks
- 0
- Thanked 0 Times in 0 Posts
Help, how do I stop Looksmart from spidering my site?
Hi,
Just checked stats for one of my sites to find that
sv-crawl . looksmart . com and cougar . dnsmaster . net
have spidered my site and used bandwidth to the tune of
3GB in 2 days!!!
What do I have to put in my robots.txt file to ban these
critters from my site?
Help, please!
(My overage starts at 2GB for the month!)
Steve
-
04-05-05 #2
Registered User
- Join Date
- Dec 2004
- Posts
- 34
- Thanks
- 0
- Thanked 0 Times in 0 Posts
You could try (if you havent already) denying access to all spiders and then overwriting that command for the spiders you do want to access your site.
Something like
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
would allow Googlebot to spider your whole site and should in theory keep all other spiders out. (of course that depends on whether the spider obey robots.txt or not)
hope this helps
mxp
-
05-05-05 #3
I'm not 100% percent on this but i don't think LookSmart has a UA.
If you're on an apache server then you also add this to your .htacess file which bans all users without a user agent.
SetEnvIf User-Agent ^$ keep_out
order allow,deny
allow from all
deny from env=keep_out
Have you contacted your host for advice about this? They should be able to help as i'm sure you're not the first.Paul Wright | Affiliate Marketing Director | Mediaedge:cia
e: paul.wright@mecglobal.com | t: 0207 803 2976 | msn: paulwright@me.com
-
05-05-05 #4
Registered User
- Join Date
- Aug 2003
- Location
- HUDDERSFIELD, UK
- Posts
- 750
- Thanks
- 0
- Thanked 0 Times in 0 Posts
Hi,
I do want to disallow (if that will work) and tried
User-agent: MantraAgent
and it seems to have worked.
I will also contact Skymarket directly - tis costing
me a bomb, right now!
Cheers, guys!
Steve
-
22-10-05 #5
Registered User
- Join Date
- Aug 2003
- Location
- HUDDERSFIELD, UK
- Posts
- 750
- Thanks
- 0
- Thanked 0 Times in 0 Posts
Hi
Paul, or anyone who knows, what's the worst that could happen
if I do what you suggest, namely: putin my .htaccess file.SetEnvIf User-Agent ^$ keep_out
order allow,deny
allow from all
deny from env=keep_out
My bandwidth is too high again, and think it's because of
spiders not obeying the robots.txt file I've got.
This whole spider business is starting to become a big problem
- one that could do with me getting to grips with how it all
works (or maybe I should just get a dedicated host and
not bother about high bandwidth usage
)
Cheers,
Steve
-
22-10-05 #6
Registered User
- Join Date
- Jun 2005
- Posts
- 972
- Thanks
- 0
- Thanked 0 Times in 0 Posts
Try a big mallet they squish em good .... failing that use google for the bot list then add it to your robots.txt as well as ya ht access file that should stop em.... but robots donot have to listen to what youve said so they can still goto all links etc if they choose to.
-
22-10-05 #7
Registered User
- Join Date
- Jul 2005
- Location
- North Devon
- Posts
- 862
- Thanks
- 11
- Thanked 18 Times in 18 Posts
you could also
drop a small bit of code into the top of your page
Code:<% Function isSpider() ' Well, normally the browser isn't a spider... isSpider = 0 ' No other meaning than forcing the isSpider behaviour ' for testing pourpose if request("spider") = 1 then isSpider = 1 ' Takes the name of the UserAgent currently used and put it ' into lower case for compairson agent = lcase(Request.ServerVariables("HTTP_USER_AGENT")) ' Now, most of the Bots refers to themself as libwww, ' java, perl, crawl, bot. let's start with some conditions ' If the agent contains "bot" then it is a Spider if instr(agent, "bot") > 0 then isSpider = 1 ' If the agent contains "perl" then it is a Spider if instr(agent, "perl") > 0 then isSpider = 1 ' If the agent contains "java" then it is a Spider if instr(agent, "java") > 0 then isSpider = 1 ' If the agent contains "libw" then it is a Spider if instr(agent, "libw") > 0 then isSpider = 1 ' If the agent contains "crawl" then it is a Spider if instr(agent, "crawl") > 0 then isSpider = 1 end function if IsSpider = 1 then if agent = "what ever the bot is called" then ' redirect away response.redirect("http://www.google.com") ' should keep it busy end if end if %>
-
22-10-05 #8
Registered User
- Join Date
- Jul 2005
- Location
- North Devon
- Posts
- 862
- Thanks
- 11
- Thanked 18 Times in 18 Posts
you could also
drop a small bit of code into the top of your page
Code:Function isSpider() ' Well, normally the browser isn't a spider... isSpider = 0 ' No other meaning than forcing the isSpider behaviour ' for testing pourpose if request("spider") = 1 then isSpider = 1 ' Takes the name of the UserAgent currently used and put it ' into lower case for compairson agent = lcase(Request.ServerVariables("HTTP_USER_AGENT")) ' Now, most of the Bots refers to themself as libwww, ' java, perl, crawl, bot. let's start with some conditions ' If the agent contains "bot" then it is a Spider if instr(agent, "bot") > 0 then isSpider = 1 ' If the agent contains "perl" then it is a Spider if instr(agent, "perl") > 0 then isSpider = 1 ' If the agent contains "java" then it is a Spider if instr(agent, "java") > 0 then isSpider = 1 ' If the agent contains "libw" then it is a Spider if instr(agent, "libw") > 0 then isSpider = 1 ' If the agent contains "crawl" then it is a Spider if instr(agent, "crawl") > 0 then isSpider = 1 end function if IsSpider = 1 then if agent = "what ever the bot is called" then ' redirect away response.redirect("http://www.google.com") ' should keep it busy end if end if
Thread Information
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks