CN108984568A - A kind of design method of full-automatic internet darknet search engine - Google Patents

A kind of design method of full-automatic internet darknet search engine Download PDF

Info

Publication number
CN108984568A
CN108984568A CN201710410888.XA CN201710410888A CN108984568A CN 108984568 A CN108984568 A CN 108984568A CN 201710410888 A CN201710410888 A CN 201710410888A CN 108984568 A CN108984568 A CN 108984568A
Authority
CN
China
Prior art keywords
link
chained library
page
url
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710410888.XA
Other languages
Chinese (zh)
Inventor
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Consulting Data Technology Co Ltd
Original Assignee
Beijing Consulting Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Consulting Data Technology Co Ltd filed Critical Beijing Consulting Data Technology Co Ltd
Priority to CN201710410888.XA priority Critical patent/CN108984568A/en
Publication of CN108984568A publication Critical patent/CN108984568A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

A kind of design method of full-automatic internet darknet search engine.The present invention proposes that a kind of depth searches the design method of plain engine, which first respectively wantonly searches for plain engine lookup information using existing, establishes the link library, then analyzes each link in chained library, obtain required link, and preserved.

Description

A kind of design method of full-automatic internet darknet search engine
Technical field
The invention belongs to internet deep search techniques.
Background technique
Internet programming technique, internet crawler technology, internet web page grab technology.
Summary of the invention
The invention proposes a kind of design method of Novel Internet search system, main method is using existing search Engine finds web portal information, carries out multi-level simulation tool, establishes the link library, then to every link in chained library, by pass Keyword is analyzed, if preserved webpage comprising keyword.
The first step is searched for using existing search engine, and browser is started, and is inputted keyword, is found website chain It connects, and link is saved, referred to as the first level chained library, automatically clicking page turning is continuous by the first level chained library It is added.Second step is, on the basis of the first time level chained library searched before, links to each and carries out depth It excavates, scans relevant information.It can be two level mining methods, tri-layer mining method is also possible to n-layer time mining method.Two layers Secondary mining method is inputted browser, scans all chains in the webpage mainly to each link in the first level chained library It connects, is saved, be built into the second level chained library, then divided every link in the second level chained library Every link is inputted browser, obtains webpage, analyzed, if the webpage includes keyword, which is saved by analysis Get off.Similar, the all-links in the first level chained library are scanned by tri-layer mining method, mainly, the first step, The obtained link of scanning is saved together with the original link in website, the second level chained library might as well be referred to as, then to saving The the second level chained library come is handled, and is exactly by each link input browser in the second level chained library, scanning All-links in the page, and all-links in the page are stored in a new chained library, third level chain might as well be referred to as Library is connect, after this step is completed, a web page analysis is carried out to all-links in third level chained library, if the webpage packet Containing keyword, then the webpage is saved.The principle of n-layer time mining method and the principle of tri-layer mining method are similar, pass through Scanning obtains the second level chained library, is then scanned to every link in the second level chained library, obtains third level Chained library, and so on, n-th layer time chained library is obtained, webpage point finally is carried out to each link in n-th layer time chained library Analysis, if just preserved the webpage comprising keyword.
The realization language of the system is python at present, but other language also may be implemented.Whole system is divided into two Point, a part is control file, mainly various xml documents, for saving and being arranged parameters.Mainly include two classes, one Class is config_auto_search.xml, and each search engine has a configuration file, for example, Baidu search has Config_auto_search.xml, etc. another kind of is keywordlist.xml and keywordlist_second.xml.
The first step starts browser, using existing search engine, including, Google, baidu, bing, yahoo, 360, sogou etc., to search webpage, establish the link library.Data are mainly read from config_auto_search.xml, Such as config_auto_baidu.xml etc..Using beautifulsoup method by config_auto_search.xml file Generate a beautifulsoup, referred to as config_soup.Reading in config_soup needs dragnet to be used Network address of standing url, such as www.baidu.com etc., read the method click_more/method of page turning, these methods are mainly wrapped Include as follows, class_name, css_selector, id, link_text, name, partial_link_text, tag_ The methods of name, xpath.It reads and clicks page turning number click_times, the content click_content of click, for example, The click on content of partial_link_text is " lower one page ".There are also the xpath of input search box, are named as input_ Xpath, text frame xpath represent the xpath where the text box of search engine web site.There are also submitting buttons Xpath is named as submit_xpath, represents the xpath of submitting button.
Data are read from keywordslist.xml using beautifulsoup method, build up second Beautifulsoup is named as keyword_soup.
The method click_method that page turning uses is read from config_soup, includes xpath, partial_ These methods of link_text, class_name, css_selector, id, link_text, name, tag_name, are looked for Click_content out, for example partial_link_text method is " lower one page ", the content of xpath is //submit [ Id=' click'] etc..
Start browser, read the parameter in config_soup, generates the url_dict of a blank for storing second Level chained library, according to different Starting mode, using the method for different starting browsers, main method is as follows, chrome Mode is started, wherein path_to_chrom_ by the way of webdriver.Chrome (path_to_chrom_driver) Driver is directed to the store path of chrome driver, chrome_binary mode, using webdriver.Remote The mode of (service.service_url, desired_capabilities=capabilities) starts, wherein Service is service.Service (path_to_chrom_driver), and path_to_chrom_driver is chrome_ The storage address of driver.The mode of chrome_binary_2, using webdriver.chrome (chrome_options= Opts), wherein opts=Option, wherein including the parameter of different options, such as binary_location.Using The mode of the mode of firefox_binary, mainly webdriver.Firefox (firefox_binary=binary) opens Dynamic, the example that wherein binary is FirefoxBinary mainly uses webdriver.Ie by the way of ie (webdriver_path) mode starts, and webdriver_path is mainly the address of ie driving.Using safari browser Method if, using webdriver.safari () method start.If method using opera browser, use The method of webdriver.opera () starts.Using microsoft edge method, using webdriver.edge () Method starting.Using htmlunit method, using webdriver.remote (desired_capabilities= Webdriver.DesiredCapabilities.HTMLUNIT method starting).Start two browsers, one is Webdriver_local, the other is webdriver_local_default.Using webdriver.set_page_load_ The method of timeout () sets its included webpage waiting time.Keyword in keyword_soup is read out, it will The search website url for needing to scan for is read, and then reads search website using the mode of drive_local.get (url), The element for being found input text box using the method for find_element_by_xpath to driver_local, is defined as Search key is input to the element using the method for send_keys by element_input.Driver_local is made It is found with the method for find_element_by_xpath and submits strong element, be named as element_submit, used The method of click is clicked.
In terms of page turning, judges the numerical value of click_times is how many, if it is 0, directly analyzed, if not being 0, then it is first analyzed, is then judged according to the method for click_method, the element for finding page turning carries out click page turning. If the method for page turning is xpath, the element is found using the method for find_element_by_xpath, uses click's Method carries out click page turning.
Analysis aspect, uses the method for driver_local.page_source by the page_source of place page first It preserves, generates a parse_soup using the method for beautifulsoup, then use find method, find webpage Element where middle link, and being extracted, then using find_all (tag_name, attribute_name: Attribute_content }) method all web page interlinkage url_local are extracted from parse_soup as first The chained library of level.And url_local is inputed into driver_local_page, is first adopted in driver_local_page Url is read with the method for driver_local_page.get (url), then uses driver_local_page.current_ The method of url obtains the url of necessary being, and is deposited into url_dict.If the url is deposited in url_dict Then directly ignoring, if it does not exist, then being added into url_dict.The method for establishing the second level chained library is as follows, In driver_local_page, url currently in use is obtained first, uses the side of driver_local_page.get (url) Method, the page_source of the page, is extracted, using the method for find_all, extracts all chains where obtaining It connects, and is enriched into url_dict.If had existed, just let it pass, if it does not, being just added into url_ dict.The method for establishing third level chained library is similar with the method for the second level chained library, the second level chained library be On the basis of first level chained library, the all-links in the corresponding webpage of each link are scanned into url_dict, third Level chained library is on the basis of the second level chained library, and the page represented to each link of the second level chained library carries out The link information of the page, is then all saved into url_dict by scanning.And so on, n-th layer time chained library is exactly On the basis of the (n-1)th level chained library, the all-links on the page representated by each link are scanned, are saved into n-th Level chained library.
It is finished when different search engines is traversed by the method for page turning, establishes n-th layer time chained library url_dict's When.Next stage is analysed and compared to the chained library in entire url_dict.
The stage includes several input elements, config_file, keyword_file, url_dict.Wherein config_ File is the configuration file in the stage, for example the configuration file of Baidu search at this stage is config_auto_second_ Level_dig_baidu.xml, keyword_file are the stage Search keyword configuration file, such as keywordlist_ Second.xml, url_dict are the n-layer time chained libraries preserved on last stage.
Each link in url_dict is analyzed first, is drawn off, is input in module, is labeled as url_ input.Configuration file config_file is generated config_soup using beautifulsoup method by the starting module stage, Keyword_file generates keyword_soup.
In terms of starting browser, start two browsers driver_local, driver_local_page, starts method It is consistent with being described above.The browsing time upper limit is arranged to each browser, is set using the method for set_page_load_timeout It sets.Webpage acquisition is carried out to each link using the mode of driver_local.get (url), uses driver_ The method of local.page_source obtains the code of the webpage, generates parse_ using the mode of beautifulsoup Soup finds the content of title using the method for find to the parse_soup, carries out analysis ratio to entire parse_soup Right, if wherein including keyword keyword, being confirmed as the webpage includes keyword, webpage is preserved, filename is The title found out above, file type are html.

Claims (1)

1. the design method machine deriving method of the system, the framework that this system uses are, first from existing major search engine In, it scans for searching using keyword and compares, establish the first level, the chained library of the second level or even n-th layer time, then The webpage represented to each link in chained library is analyzed, and finds out the webpage comprising key word information, thus raw At the webpage information of needs, for example the system is to be realized with python, but also may be implemented there are also other language versions, Also this method used.
CN201710410888.XA 2017-06-04 2017-06-04 A kind of design method of full-automatic internet darknet search engine Pending CN108984568A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710410888.XA CN108984568A (en) 2017-06-04 2017-06-04 A kind of design method of full-automatic internet darknet search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710410888.XA CN108984568A (en) 2017-06-04 2017-06-04 A kind of design method of full-automatic internet darknet search engine

Publications (1)

Publication Number Publication Date
CN108984568A true CN108984568A (en) 2018-12-11

Family

ID=64501790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710410888.XA Pending CN108984568A (en) 2017-06-04 2017-06-04 A kind of design method of full-automatic internet darknet search engine

Country Status (1)

Country Link
CN (1) CN108984568A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925970A (en) * 2019-12-05 2021-06-08 天津挺哥网络科技有限公司 Design method of novel hidden net full-network excavating robot
CN112925966A (en) * 2019-12-05 2021-06-08 天津挺哥网络科技有限公司 Design method of novel hidden net excavating robot
CN113392298A (en) * 2020-03-14 2021-09-14 天津挺哥网络科技有限公司 Design method of fixed-point darknet information mining analysis system based on xml database

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field
CN105955984A (en) * 2016-04-19 2016-09-21 中国银联股份有限公司 Network data searching method based on crawler mode
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field
CN105955984A (en) * 2016-04-19 2016-09-21 中国银联股份有限公司 Network data searching method based on crawler mode
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925970A (en) * 2019-12-05 2021-06-08 天津挺哥网络科技有限公司 Design method of novel hidden net full-network excavating robot
CN112925966A (en) * 2019-12-05 2021-06-08 天津挺哥网络科技有限公司 Design method of novel hidden net excavating robot
CN113392298A (en) * 2020-03-14 2021-09-14 天津挺哥网络科技有限公司 Design method of fixed-point darknet information mining analysis system based on xml database

Similar Documents

Publication Publication Date Title
US9785714B2 (en) Method and/or system for searching network content
US7418440B2 (en) Method and system for extraction and organizing selected data from sources on a network
US7653623B2 (en) Information searching apparatus and method with mechanism of refining search results
US9305100B2 (en) Object oriented data and metadata based search
US7072883B2 (en) System for gathering and summarizing internet information
US9164987B2 (en) Translating a search query into multiple languages
CN100485603C (en) Systems and methods for generating concept units from search queries
US7917493B2 (en) Indexing and searching product identifiers
CN108052632B (en) Network information acquisition method and system and enterprise information search system
US20070022096A1 (en) Method and system for searching a plurality of web sites
KR20060017765A (en) Concept network
KR20070062408A (en) System for automatic arrangement of portlets on portal pages according to semantical and functional relationship
US20110208715A1 (en) Automatically mining intents of a group of queries
US20110238653A1 (en) Parsing and indexing dynamic reports
CN101393565A (en) Facing virtual museum searching method based on noumenon
CN108984568A (en) A kind of design method of full-automatic internet darknet search engine
JP4207438B2 (en) XML document storage / retrieval apparatus, XML document storage / retrieval method used therefor, and program thereof
Shestakov et al. DEQUE: querying the deep web
US7836108B1 (en) Clustering by previous representative
JP2008191982A (en) Retrieval result output device
KR20000071937A (en) Method for retrieving data on internet through constructing site information database
US20050033732A1 (en) Search engine having navigation path and orphan file features
US7490082B2 (en) System and method for searching internet domains
Álvarez et al. A Task-specific Approach for Crawling the Deep Web.
US20040199501A1 (en) Information extracting apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 262, Zhongguancun Science and Technology City Exhibition Center, intersection of Zhongguancun Avenue and Xihuan North Road, Baodi District, Tianjin

Applicant after: Tianjin Xunda Data Technology Co.,Ltd.

Address before: 100080 room 107-22, 1st floor, 127-1 Zhongguancun North Street, Haidian District, Beijing

Applicant before: BEIJING XUNDATA TECHNOLOGY Co.,Ltd.

WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181211