Web Crawling: How does a crawler index dynamic content?

No.114/4, Ambagahawaththa, Madapatha, Piliyandala

info@ceyloncreations.com

+94 70 25 68 356 | +94 71 61 62 111 | +94 75 40 68 377

Questions › Category: Other › Web Crawling: How does a crawler index dynamic content?

Anonymous asked 2 years ago

Web Crawling: How does a crawler index dynamic content?

Question Tags: Web Crawling: How does a crawler index dynamic content?

1 Answers

0 Vote Up Vote Down

Editor answered 2 years ago

It’s not necessary about the forms, as other responders see it. I think the approach should be almost the same, as with other URLs (crawl seed page or get a site map, get links, follow them, repeat), except
Special logic for identifying canonical URL. Because some poorly written websites tend to include randomly generated query parameters, or session ids, which needs to be filtered. This is achieved by URL Classifier which splits URL into parts and decides which of them are important, and which can be truncated without causing content to change. Such classifier based on manually written heuristics (covering typical cases like page numbers) and content fingerprint.
Prioritization based on detected website engine. I think modern search engines detect URL schemes of WordPress, PHPbb, vBulletin, Joomla, etc. For example, it doesn’t make much sense to crawl deep inside long forum thread before crawling first pages of the majority of the threads.

Google and probably other search engines are using iterative probing approach to identify the candidate keywords for a text box in forms.

Keep in mind that large scale search engines have access to external sources of URLs: browsing history logs, URL shorteners archives, social networks. This data can be used for efficient crawling and discovery of dynamic content.

Your Answer

No.114/4, Ambagahawaththa, Madapatha, Piliyandala

info@ceyloncreations.com

+94 70 25 68 356 | +94 71 61 62 111 | +94 75 40 68 377

Web Hosting Service

Software Solutions

Content Marketing

Website Design & Development

Social Media Marketing

Search Engine Optimization