Program algorithm

Previous Next

When parsing campaign is launching, the Queue is filled using Input data tab or Navigation->Starting URLs tab settings. The Queue can also be filled by URLs from the Queue dump of previous parsing. As for the History, it can be filled by URLs from previous parsing History dump. Datacol threads start working in parallel to process URLs from the Queue. When URL is removed from the Queue for processing, it is immediately added to the History. Each URL is processed accordingly to algorithm described below:


1. The URL is checked if it is suitable for collecting data or links. Checking is performed according to Data Collecting->Basic and Navigation->Basic Navigation tab settings.


2. Datacol loads URL (webpage) to get its source code.


3. Source code of the loaded webpage is checked if is suitable for collecting data or links. Checking is performed according to Data Collecting->Basic and Navigation->Basic Navigation tab settings.


4. If the webpage is suitable for data collecting (this was previously checked by its URL and source code), then the data from this page are extracted according to Data collecting tab settings.


5.  If the webpage is suitable for collecting links (this was checked previously by its URL and source code), then the links from this page are collected according to Navigation->Link collecting tab settings. Collected links are added to the Queue, ingoring the following ones:

- links, whose URLs are not suitable neither for data collecting (Data Collecting->Basic tab), nor for link collecting (Navigation->Basic Navigation);

- links, which already exist in the History.


6. Export results of data extraction. According to Export tab settings, results can be saved to custom format (usually CSV or TXT file), Excel, Mysql, Wordpress or via export plugin.

Created with the Personal Edition of HelpNDoc: Produce Kindle eBooks easily