1. #1
    smn2 is an unknown quantity at this point Registered User
    Join Date
    Aug 2003
    Location
    HUDDERSFIELD, UK
    Posts
    750
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Help, how do I stop Looksmart from spidering my site?

    Hi,

    Just checked stats for one of my sites to find that
    sv-crawl . looksmart . com and cougar . dnsmaster . net
    have spidered my site and used bandwidth to the tune of
    3GB in 2 days!!!

    What do I have to put in my robots.txt file to ban these
    critters from my site?

    Help, please!

    (My overage starts at 2GB for the month!)

    Steve

  2. #2
    mxp
    mxp is offline
    mxp is an unknown quantity at this point Registered User
    Join Date
    Dec 2004
    Posts
    34
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You could try (if you havent already) denying access to all spiders and then overwriting that command for the spiders you do want to access your site.

    Something like

    User-agent: *
    Disallow: /

    User-agent: Googlebot
    Disallow:

    would allow Googlebot to spider your whole site and should in theory keep all other spiders out. (of course that depends on whether the spider obey robots.txt or not)

    hope this helps
    mxp

  3. #3
    Paul Wright is an unknown quantity at this point Paul Wright's Avatar Fishboy
    Join Date
    Jan 2005
    Location
    London
    Posts
    1,716
    Thanks
    28
    Thanked 17 Times in 12 Posts
    I'm not 100% percent on this but i don't think LookSmart has a UA.

    If you're on an apache server then you also add this to your .htacess file which bans all users without a user agent.

    SetEnvIf User-Agent ^$ keep_out
    order allow,deny
    allow from all
    deny from env=keep_out

    Have you contacted your host for advice about this? They should be able to help as i'm sure you're not the first.
    Paul Wright | Affiliate Marketing Director | Mediaedge:cia
    e: paul.wright@mecglobal.com | t: 0207 803 2976 | msn: paulwright@me.com

  4. #4
    smn2 is an unknown quantity at this point Registered User
    Join Date
    Aug 2003
    Location
    HUDDERSFIELD, UK
    Posts
    750
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi,

    I do want to disallow (if that will work) and tried
    User-agent: MantraAgent

    and it seems to have worked.

    I will also contact Skymarket directly - tis costing
    me a bomb, right now!

    Cheers, guys!

    Steve

  5. #5
    smn2 is an unknown quantity at this point Registered User
    Join Date
    Aug 2003
    Location
    HUDDERSFIELD, UK
    Posts
    750
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi

    Paul, or anyone who knows, what's the worst that could happen
    if I do what you suggest, namely: put
    SetEnvIf User-Agent ^$ keep_out
    order allow,deny
    allow from all
    deny from env=keep_out
    in my .htaccess file.

    My bandwidth is too high again, and think it's because of
    spiders not obeying the robots.txt file I've got.

    This whole spider business is starting to become a big problem
    - one that could do with me getting to grips with how it all
    works (or maybe I should just get a dedicated host and
    not bother about high bandwidth usage )

    Cheers,
    Steve

  6. #6
    Itchy is an unknown quantity at this point Registered User
    Join Date
    Jun 2005
    Posts
    972
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Try a big mallet they squish em good .... failing that use google for the bot list then add it to your robots.txt as well as ya ht access file that should stop em.... but robots donot have to listen to what youve said so they can still goto all links etc if they choose to.

  7. #7
    futureweb is an unknown quantity at this point Registered User
    Join Date
    Jul 2005
    Location
    North Devon
    Posts
    862
    Thanks
    11
    Thanked 18 Times in 18 Posts
    you could also

    drop a small bit of code into the top of your page
    Code:
    <%
    Function isSpider()
    	' Well, normally the browser isn't a spider...
    	isSpider = 0
    	' No other meaning than forcing the isSpider behaviour
    	' for testing pourpose
    	if request("spider") = 1 then isSpider = 1
    	' Takes the name of the UserAgent currently used and put it
    	' into lower case for compairson
    	agent = lcase(Request.ServerVariables("HTTP_USER_AGENT"))
    	' Now, most of the Bots refers to themself as libwww,
    	' java, perl, crawl, bot. let's start with some conditions
    	' If the agent contains "bot" then it is a Spider
    	if instr(agent, "bot")  > 0 then isSpider = 1
    	' If the agent contains "perl" then it is a Spider
    	if instr(agent, "perl") > 0 then isSpider = 1
    	' If the agent contains "java" then it is a Spider
    	if instr(agent, "java") > 0 then isSpider = 1
    	' If the agent contains "libw" then it is a Spider
    	if instr(agent, "libw") > 0 then isSpider = 1
    	' If the agent contains "crawl" then it is a Spider
    	if instr(agent, "crawl") > 0 then isSpider = 1
    end function
    
    if IsSpider = 1 then
    
    if agent = "what ever the bot is called" then
    ' redirect away
    response.redirect("http://www.google.com") ' should keep it busy 
    end if
    
    end if
    %>

  8. #8
    futureweb is an unknown quantity at this point Registered User
    Join Date
    Jul 2005
    Location
    North Devon
    Posts
    862
    Thanks
    11
    Thanked 18 Times in 18 Posts
    you could also

    drop a small bit of code into the top of your page
    Code:
    Function isSpider()
    	' Well, normally the browser isn't a spider...
    	isSpider = 0
    	' No other meaning than forcing the isSpider behaviour
    	' for testing pourpose
    	if request("spider") = 1 then isSpider = 1
    	' Takes the name of the UserAgent currently used and put it
    	' into lower case for compairson
    	agent = lcase(Request.ServerVariables("HTTP_USER_AGENT"))
    	' Now, most of the Bots refers to themself as libwww,
    	' java, perl, crawl, bot. let's start with some conditions
    	' If the agent contains "bot" then it is a Spider
    	if instr(agent, "bot")  > 0 then isSpider = 1
    	' If the agent contains "perl" then it is a Spider
    	if instr(agent, "perl") > 0 then isSpider = 1
    	' If the agent contains "java" then it is a Spider
    	if instr(agent, "java") > 0 then isSpider = 1
    	' If the agent contains "libw" then it is a Spider
    	if instr(agent, "libw") > 0 then isSpider = 1
    	' If the agent contains "crawl" then it is a Spider
    	if instr(agent, "crawl") > 0 then isSpider = 1
    end function
    
    if IsSpider = 1 then
    
    if agent = "what ever the bot is called" then
    ' redirect away
    response.redirect("http://www.google.com") ' should keep it busy 
    end if
    
    end if

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Content Relevant URLs by vBSEO 3.5.0 RC2