CN108984568A

CN108984568A - A kind of design method of full-automatic internet darknet search engine

Info

Publication number: CN108984568A
Application number: CN201710410888.XA
Authority: CN
Inventors: 刘挺
Original assignee: Beijing Consulting Data Technology Co Ltd
Current assignee: Beijing Consulting Data Technology Co Ltd
Priority date: 2017-06-04
Filing date: 2017-06-04
Publication date: 2018-12-11

Abstract

A kind of design method of full-automatic internet darknet search engine.The present invention proposes that a kind of depth searches the design method of plain engine, which first respectively wantonly searches for plain engine lookup information using existing, establishes the link library, then analyzes each link in chained library, obtain required link, and preserved.

Description

A kind of design method of full-automatic internet darknet search engine

Technical field

The invention belongs to internet deep search techniques.

Background technique

Internet programming technique, internet crawler technology, internet web page grab technology.

Summary of the invention

The invention proposes a kind of design method of Novel Internet search system, main method is using existing search Engine finds web portal information, carries out multi-level simulation tool, establishes the link library, then to every link in chained library, by pass Keyword is analyzed, if preserved webpage comprising keyword.

The first step is searched for using existing search engine, and browser is started, and is inputted keyword, is found website chain It connects, and link is saved, referred to as the first level chained library, automatically clicking page turning is continuous by the first level chained library It is added.Second step is, on the basis of the first time level chained library searched before, links to each and carries out depth It excavates, scans relevant information.It can be two level mining methods, tri-layer mining method is also possible to n-layer time mining method.Two layers Secondary mining method is inputted browser, scans all chains in the webpage mainly to each link in the first level chained library It connects, is saved, be built into the second level chained library, then divided every link in the second level chained library Every link is inputted browser, obtains webpage, analyzed, if the webpage includes keyword, which is saved by analysis Get off.Similar, the all-links in the first level chained library are scanned by tri-layer mining method, mainly, the first step, The obtained link of scanning is saved together with the original link in website, the second level chained library might as well be referred to as, then to saving The the second level chained library come is handled, and is exactly by each link input browser in the second level chained library, scanning All-links in the page, and all-links in the page are stored in a new chained library, third level chain might as well be referred to as Library is connect, after this step is completed, a web page analysis is carried out to all-links in third level chained library, if the webpage packet Containing keyword, then the webpage is saved.The principle of n-layer time mining method and the principle of tri-layer mining method are similar, pass through Scanning obtains the second level chained library, is then scanned to every link in the second level chained library, obtains third level Chained library, and so on, n-th layer time chained library is obtained, webpage point finally is carried out to each link in n-th layer time chained library Analysis, if just preserved the webpage comprising keyword.

The realization language of the system is python at present, but other language also may be implemented.Whole system is divided into two Point, a part is control file, mainly various xml documents, for saving and being arranged parameters.Mainly include two classes, one Class is config_auto_search.xml, and each search engine has a configuration file, for example, Baidu search has Config_auto_search.xml, etc. another kind of is keywordlist.xml and keywordlist_second.xml.

The first step starts browser, using existing search engine, including, Google, baidu, bing, yahoo, 360, sogou etc., to search webpage, establish the link library.Data are mainly read from config_auto_search.xml, Such as config_auto_baidu.xml etc..Using beautifulsoup method by config_auto_search.xml file Generate a beautifulsoup, referred to as config_soup.Reading in config_soup needs dragnet to be used Network address of standing url, such as www.baidu.com etc., read the method click_more/method of page turning, these methods are mainly wrapped Include as follows, class_name, css_selector, id, link_text, name, partial_link_text, tag_ The methods of name, xpath.It reads and clicks page turning number click_times, the content click_content of click, for example, The click on content of partial_link_text is " lower one page ".There are also the xpath of input search box, are named as input_ Xpath, text frame xpath represent the xpath where the text box of search engine web site.There are also submitting buttons Xpath is named as submit_xpath, represents the xpath of submitting button.

Data are read from keywordslist.xml using beautifulsoup method, build up second Beautifulsoup is named as keyword_soup.

The method click_method that page turning uses is read from config_soup, includes xpath, partial_ These methods of link_text, class_name, css_selector, id, link_text, name, tag_name, are looked for Click_content out, for example partial_link_text method is " lower one page ", the content of xpath is //submit [ Id=' click'] etc..

Start browser, read the parameter in config_soup, generates the url_dict of a blank for storing second Level chained library, according to different Starting mode, using the method for different starting browsers, main method is as follows, chrome Mode is started, wherein path_to_chrom_ by the way of webdriver.Chrome (path_to_chrom_driver) Driver is directed to the store path of chrome driver, chrome_binary mode, using webdriver.Remote The mode of (service.service_url, desired_capabilities=capabilities) starts, wherein Service is service.Service (path_to_chrom_driver), and path_to_chrom_driver is chrome_ The storage address of driver.The mode of chrome_binary_2, using webdriver.chrome (chrome_options= Opts), wherein opts=Option, wherein including the parameter of different options, such as binary_location.Using The mode of the mode of firefox_binary, mainly webdriver.Firefox (firefox_binary=binary) opens Dynamic, the example that wherein binary is FirefoxBinary mainly uses webdriver.Ie by the way of ie (webdriver_path) mode starts, and webdriver_path is mainly the address of ie driving.Using safari browser Method if, using webdriver.safari () method start.If method using opera browser, use The method of webdriver.opera () starts.Using microsoft edge method, using webdriver.edge () Method starting.Using htmlunit method, using webdriver.remote (desired_capabilities= Webdriver.DesiredCapabilities.HTMLUNIT method starting).Start two browsers, one is Webdriver_local, the other is webdriver_local_default.Using webdriver.set_page_load_ The method of timeout () sets its included webpage waiting time.Keyword in keyword_soup is read out, it will The search website url for needing to scan for is read, and then reads search website using the mode of drive_local.get (url), The element for being found input text box using the method for find_element_by_xpath to driver_local, is defined as Search key is input to the element using the method for send_keys by element_input.Driver_local is made It is found with the method for find_element_by_xpath and submits strong element, be named as element_submit, used The method of click is clicked.

In terms of page turning, judges the numerical value of click_times is how many, if it is 0, directly analyzed, if not being 0, then it is first analyzed, is then judged according to the method for click_method, the element for finding page turning carries out click page turning. If the method for page turning is xpath, the element is found using the method for find_element_by_xpath, uses click's Method carries out click page turning.

Analysis aspect, uses the method for driver_local.page_source by the page_source of place page first It preserves, generates a parse_soup using the method for beautifulsoup, then use find method, find webpage Element where middle link, and being extracted, then using find_all (tag_name, attribute_name: Attribute_content }) method all web page interlinkage url_local are extracted from parse_soup as first The chained library of level.And url_local is inputed into driver_local_page, is first adopted in driver_local_page Url is read with the method for driver_local_page.get (url), then uses driver_local_page.current_ The method of url obtains the url of necessary being, and is deposited into url_dict.If the url is deposited in url_dict Then directly ignoring, if it does not exist, then being added into url_dict.The method for establishing the second level chained library is as follows, In driver_local_page, url currently in use is obtained first, uses the side of driver_local_page.get (url) Method, the page_source of the page, is extracted, using the method for find_all, extracts all chains where obtaining It connects, and is enriched into url_dict.If had existed, just let it pass, if it does not, being just added into url_ dict.The method for establishing third level chained library is similar with the method for the second level chained library, the second level chained library be On the basis of first level chained library, the all-links in the corresponding webpage of each link are scanned into url_dict, third Level chained library is on the basis of the second level chained library, and the page represented to each link of the second level chained library carries out The link information of the page, is then all saved into url_dict by scanning.And so on, n-th layer time chained library is exactly On the basis of the (n-1)th level chained library, the all-links on the page representated by each link are scanned, are saved into n-th Level chained library.

It is finished when different search engines is traversed by the method for page turning, establishes n-th layer time chained library url_dict's When.Next stage is analysed and compared to the chained library in entire url_dict.

The stage includes several input elements, config_file, keyword_file, url_dict.Wherein config_ File is the configuration file in the stage, for example the configuration file of Baidu search at this stage is config_auto_second_ Level_dig_baidu.xml, keyword_file are the stage Search keyword configuration file, such as keywordlist_ Second.xml, url_dict are the n-layer time chained libraries preserved on last stage.

Each link in url_dict is analyzed first, is drawn off, is input in module, is labeled as url_ input.Configuration file config_file is generated config_soup using beautifulsoup method by the starting module stage, Keyword_file generates keyword_soup.

In terms of starting browser, start two browsers driver_local, driver_local_page, starts method It is consistent with being described above.The browsing time upper limit is arranged to each browser, is set using the method for set_page_load_timeout It sets.Webpage acquisition is carried out to each link using the mode of driver_local.get (url), uses driver_ The method of local.page_source obtains the code of the webpage, generates parse_ using the mode of beautifulsoup Soup finds the content of title using the method for find to the parse_soup, carries out analysis ratio to entire parse_soup Right, if wherein including keyword keyword, being confirmed as the webpage includes keyword, webpage is preserved, filename is The title found out above, file type are html.

Claims

1. the design method machine deriving method of the system, the framework that this system uses are, first from existing major search engine In, it scans for searching using keyword and compares, establish the first level, the chained library of the second level or even n-th layer time, then The webpage represented to each link in chained library is analyzed, and finds out the webpage comprising key word information, thus raw At the webpage information of needs, for example the system is to be realized with python, but also may be implemented there are also other language versions, Also this method used.