+ Reply to Thread
Results 1 to 8 of 8

 

Thread: Help, how do I stop Looksmart from spidering my site?

  1. #1
    Registered User

    Status
    Offline
    Join Date
    Aug 2003
    Location
    HUDDERSFIELD, UK
    Posts
    750
    Thanks
    0
    Thanked 0 Times in 0 Posts


    Hi,

    Just checked stats for one of my sites to find that
    sv-crawl . looksmart . com and cougar . dnsmaster . net
    have spidered my site and used bandwidth to the tune of
    3GB in 2 days!!!

    What do I have to put in my robots.txt file to ban these
    critters from my site?

    Help, please!

    (My overage starts at 2GB for the month!)

    Steve

  2. #2
    mxp
    Registered User

    Status
    Offline
    Join Date
    Dec 2004
    Posts
    34
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You could try (if you havent already) denying access to all spiders and then overwriting that command for the spiders you do want to access your site.

    Something like

    User-agent: *
    Disallow: /

    User-agent: Googlebot
    Disallow:

    would allow Googlebot to spider your whole site and should in theory keep all other spiders out. (of course that depends on whether the spider obey robots.txt or not)

    hope this helps
    mxp

  3. #3
    Paul Wright's Avatar
    Fishboy

    Status
    Offline
    Join Date
    Jan 2005
    Location
    London
    Posts
    1,735
    Thanks
    32
    Thanked 20 Times in 14 Posts
    I'm not 100% percent on this but i don't think LookSmart has a UA.

    If you're on an apache server then you also add this to your .htacess file which bans all users without a user agent.

    SetEnvIf User-Agent ^$ keep_out
    order allow,deny
    allow from all
    deny from env=keep_out

    Have you contacted your host for advice about this? They should be able to help as i'm sure you're not the first.
    Agency Services Director | e: paul.wright@tradedoubler.com | t: 0207 798 5825


  4. #4
    Registered User

    Status
    Offline
    Join Date
    Aug 2003
    Location
    HUDDERSFIELD, UK
    Posts
    750
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi,

    I do want to disallow (if that will work) and tried
    User-agent: MantraAgent

    and it seems to have worked.

    I will also contact Skymarket directly - tis costing
    me a bomb, right now!

    Cheers, guys!

    Steve

  5. #5
    Registered User

    Status
    Offline
    Join Date
    Aug 2003
    Location
    HUDDERSFIELD, UK
    Posts
    750
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi

    Paul, or anyone who knows, what's the worst that could happen
    if I do what you suggest, namely: put
    SetEnvIf User-Agent ^$ keep_out
    order allow,deny
    allow from all
    deny from env=keep_out
    in my .htaccess file.

    My bandwidth is too high again, and think it's because of
    spiders not obeying the robots.txt file I've got.

    This whole spider business is starting to become a big problem
    - one that could do with me getting to grips with how it all
    works (or maybe I should just get a dedicated host and
    not bother about high bandwidth usage )

    Cheers,
    Steve

  6. #6
    Registered User

    Status
    Offline
    Join Date
    Jun 2005
    Posts
    971
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Try a big mallet they squish em good .... failing that use google for the bot list then add it to your robots.txt as well as ya ht access file that should stop em.... but robots donot have to listen to what youve said so they can still goto all links etc if they choose to.

  7. #7
    Registered User

    Status
    Offline
    Join Date
    Jul 2005
    Location
    North Devon
    Posts
    899
    Thanks
    13
    Thanked 21 Times in 21 Posts
    you could also

    drop a small bit of code into the top of your page
    Code:
    <%
    Function isSpider()
    	' Well, normally the browser isn't a spider...
    	isSpider = 0
    	' No other meaning than forcing the isSpider behaviour
    	' for testing pourpose
    	if request("spider") = 1 then isSpider = 1
    	' Takes the name of the UserAgent currently used and put it
    	' into lower case for compairson
    	agent = lcase(Request.ServerVariables("HTTP_USER_AGENT"))
    	' Now, most of the Bots refers to themself as libwww,
    	' java, perl, crawl, bot. let's start with some conditions
    	' If the agent contains "bot" then it is a Spider
    	if instr(agent, "bot")  > 0 then isSpider = 1
    	' If the agent contains "perl" then it is a Spider
    	if instr(agent, "perl") > 0 then isSpider = 1
    	' If the agent contains "java" then it is a Spider
    	if instr(agent, "java") > 0 then isSpider = 1
    	' If the agent contains "libw" then it is a Spider
    	if instr(agent, "libw") > 0 then isSpider = 1
    	' If the agent contains "crawl" then it is a Spider
    	if instr(agent, "crawl") > 0 then isSpider = 1
    end function
    
    if IsSpider = 1 then
    
    if agent = "what ever the bot is called" then
    ' redirect away
    response.redirect("http://www.google.com") ' should keep it busy 
    end if
    
    end if
    %>

  8. #8
    Registered User

    Status
    Offline
    Join Date
    Jul 2005
    Location
    North Devon
    Posts
    899
    Thanks
    13
    Thanked 21 Times in 21 Posts
    you could also

    drop a small bit of code into the top of your page
    Code:
    Function isSpider()
    	' Well, normally the browser isn't a spider...
    	isSpider = 0
    	' No other meaning than forcing the isSpider behaviour
    	' for testing pourpose
    	if request("spider") = 1 then isSpider = 1
    	' Takes the name of the UserAgent currently used and put it
    	' into lower case for compairson
    	agent = lcase(Request.ServerVariables("HTTP_USER_AGENT"))
    	' Now, most of the Bots refers to themself as libwww,
    	' java, perl, crawl, bot. let's start with some conditions
    	' If the agent contains "bot" then it is a Spider
    	if instr(agent, "bot")  > 0 then isSpider = 1
    	' If the agent contains "perl" then it is a Spider
    	if instr(agent, "perl") > 0 then isSpider = 1
    	' If the agent contains "java" then it is a Spider
    	if instr(agent, "java") > 0 then isSpider = 1
    	' If the agent contains "libw" then it is a Spider
    	if instr(agent, "libw") > 0 then isSpider = 1
    	' If the agent contains "crawl" then it is a Spider
    	if instr(agent, "crawl") > 0 then isSpider = 1
    end function
    
    if IsSpider = 1 then
    
    if agent = "what ever the bot is called" then
    ' redirect away
    response.redirect("http://www.google.com") ' should keep it busy 
    end if
    
    end if

+ Reply to Thread


Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
To Top

Content Relevant URLs by vBSEO 3.5.0 RC2