E0688032514 - International Journal of Emerging Science and Engineering (IJESE)

Robots Exclusion Protocol
Pooja Jha¹, Soni Goyal², Tanya Kumari³, Neha Gupta⁴
¹Pooja Jha, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India.
²Soni Goyal, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India.
³Tanya Kumari, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India.
⁴Ms. Neha Gupta, Asst. Professor, Department of Information Technology, Bharati Vidyapeeth’s College Of Engineering, New Delhi, India
Manuscript received on March 11, 2014. | Revised Manuscript received on March 15, 2014. | Manuscript published on March 25, 2014. | PP:52-55 | Volume-2 Issue-5, March 2014. | Retrieval Number: E0688032514/2014©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: World Wide Web (WWW) is a big dynamic network and a repository of interconnected documents and other resources, linked by hyperlinks and URLs. Web crawlers are used to recursively traverse and download web pages for search engines to create and maintain the web indices. Moreover, the need of maintaining the up-to-date pages causes repeated traversal of websites by crawler. Due to this, the resources like CPU cycles, disk space, and network bandwidth, etc., become overloaded which may lead to crashing of website and increase in web traffic. However, websites can limit the crawlers through Robots Exclusion Protocol. It is a mechanism for www servers to indicate to crawlers which part of their server should not be accessed. To implement this protocol, a plain text file called robots.txt is created and placed under root directory of the web servers. This approach was chosen as a crawler can find the access policy with only single document retrieval. Also, it supports auto-discovery of XML sitemaps. Thus, this protocol aids in controlling the crawler’s activity.
Keywords: Robots Exclusion Protocol, robots.txt, Robots Meta tags, web crawler

Download PDF

JOURNAL

REQUIREMENTS

PRODUCT

Contact Us