QuestionsCategory: OtherWeb Crawling: How does a crawler index dynamic content?
Anonymous asked 10 months ago

Web Crawling: How does a crawler index dynamic content?

1 Answers
Editor answered 10 months ago

It’s not necessary about the forms, as other responders see it. I think the approach should be almost the same, as with other URLs (crawl seed page or get a site map, get links, follow them, repeat), except
Special logic for identifying canonical URL. Because some poorly written websites tend to include randomly generated query parameters, or session ids, which needs to be filtered. This is achieved by URL Classifier which splits URL into parts and decides which of them are important, and which can be truncated without causing content to change. Such classifier based on manually written heuristics (covering typical cases like page numbers) and content fingerprint.
Prioritization based on detected website engine. I think modern search engines detect URL schemes of WordPress, PHPbb, vBulletin, Joomla, etc. For example, it doesn’t make much sense to crawl deep inside long forum thread before crawling first pages of the majority of the threads.

Google and probably other search engines are using iterative probing approach to identify the candidate keywords for a text box in forms.

Keep in mind that large scale search engines have access to external sources of URLs: browsing history logs, URL shorteners archives, social networks. This data can be used for efficient crawling and discovery of dynamic content.

Your Answer

5 + 12 =