Hi all,
I have been an affiliate for a long time and always got frustrated when an advertiser i wanted to publish did not supply a product catalog in a way that was convenient to automatically load into a web site.
For dealing with the issue i started to develop web crawlers to do the job for me. When an advertiser had no product catalog i created a dedicated crawler for his web site to collect all the data i wanted to post in my site.
After a while creating numerous site-dedicated crawlers, i have decided to take the time and create a dynamic crawler who will (hoping...) deal with any new and existing advertiser web site i want to crawl and publish.
For those of you who stumbled the problem and looking for a solution, I made the project open sourced and published it under Google code.
Link: regexspider - Project Hosting on Google Code
At the project page you can find a download package of an Alpha version that demonstrate the concept of crawling and extracting information based on Regular expressions.
Please remember, this is a very early stage of development and there are a lot of feature yet to be implemented.
I'd love to get some feedback and ideas that will take this project to the next level, serving us best in out data extraction quest
Also, developers who are interested in helping out with this project are more then welcome to Contact me.
Thanks!
There are currently 1 users browsing this thread. (0 members and 1 guests)
Bookmarks