Robots Exclusion Protocol
Pooja Jha1, Soni Goyal2, Tanya Kumari3, Neha Gupta4
1Pooja Jha, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India.
2Soni Goyal, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India.
3Tanya Kumari, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India.
4Ms. Neha Gupta, Asst. Professor, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India
Manuscript received on March 11, 2014. | Revised Manuscript received on March 15, 2014. | Manuscript published on March 25, 2014. | PP:52-55 | Volume-2 Issue-5, March 2014. | Retrieval Number: E0688032514/2014©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: World Wide Web (WWW) is a big dynamic network and a repository of interconnected documents and other resources, linked by hyperlinks and URLs. Web crawlers are used to recursively traverse and download web pages for search engines to create and maintain the web indices. Moreover, the need of maintaining the up-to-date pages causes repeated traversal of websites by crawler. Due to this, the resources like CPU cycles, disk space, and network bandwidth, etc., become overloaded which may lead to crashing of website and increase in web traffic. However, websites can limit the crawlers through Robots Exclusion Protocol. It is a mechanism for www servers to indicate to crawlers which part of their server should not be accessed. To implement this protocol, a plain text file called robots.txt is created and placed under root directory of the web servers. This approach was chosen as a crawler can find the access policy with only single document retrieval. Also, it supports auto-discovery of XML sitemaps. Thus, this protocol aids in controlling the crawler’s activity.
Keywords: Robots Exclusion Protocol, robots.txt, Robots Meta tags, web crawler