CN108052632A - A kind of method for obtaining network information, system and company information search system - Google Patents

A kind of method for obtaining network information, system and company information search system Download PDF

Info

Publication number
CN108052632A
CN108052632A CN201711381367.2A CN201711381367A CN108052632A CN 108052632 A CN108052632 A CN 108052632A CN 201711381367 A CN201711381367 A CN 201711381367A CN 108052632 A CN108052632 A CN 108052632A
Authority
CN
China
Prior art keywords
information
data
webpage
data message
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711381367.2A
Other languages
Chinese (zh)
Other versions
CN108052632B (en
Inventor
彭帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Law Cloud Technology Co Ltd
Original Assignee
Chengdu Law Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Law Cloud Technology Co Ltd filed Critical Chengdu Law Cloud Technology Co Ltd
Priority to CN201711381367.2A priority Critical patent/CN108052632B/en
Publication of CN108052632A publication Critical patent/CN108052632A/en
Application granted granted Critical
Publication of CN108052632B publication Critical patent/CN108052632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of method for obtaining network information, system and company information search system, including obtain with the associated webpage information of specify information, the object pages of associating web pages, and the data message in the extracting object page are obtained according to selected search strategy.The present invention is by crawler technology and with targetedly search strategy, data are completed in the excavation netted deeply, allows users to get substantial amounts of valid data in a short time, avoids to each independent Web inquiring about one by one, it has provided one-stop information service to the user, has improved the efficiency of gathered data.

Description

A kind of method for obtaining network information, system and company information search system
Technical field
The present invention relates to the method and system that a kind of network information obtains, particularly a kind of webpage letter based on crawler system Cease acquisition methods and system..
Background technology
In the current big data epoch, the vast resources on network allow user to have more visitors or business than one can attend to, being largely distributed, easily purchase Information is come into being.For example, the relevant information if necessary to obtain enterprise, it can be directly by including national credit information of enterprise Publicity system, Chinese law court judgement document's net, Chinese execution information disclose net, official website of State Intellectual Property Office, national work The correlation official website such as official website of trademark office of business general bureau, official website of the National Copyright Administration of the People's Republic of China and recruitment website searches.It is however, above-mentioned Company information involved by all kinds of websites is different, for example, national credit information of enterprise publicity system is believed including business license The information such as breath, key personnel, judgement document's net mainly for discriminative information, government website generally comprised business standing data and Acceptance of the bid data, and recruitment website is then more related to the information such as position, wage.As it can be seen that different information sources is in different networks Platform, and the data on platform are typically independent and not shared, if it is desired to targetedly obtaining one or more enterprises Relevant information then needs to be inquired about by different platforms, relatively complicated for user.
On the other hand, the tools such as the industrial and commercial information of enterprise, recruitment information, the judgement document being related to and intellectual property information There is the property of deep layer network, wherein, the concept netted deeply is defined compared with cover web, refers to that those cannot be by general search The acquired content of operation.In order to effectively easily obtain the required network information and resource, search engine is as common letter Breath gopher becomes entrance and platform that user accesses internet.But general search engine has certain limitation Property, the content netted deeply is often difficult to obtain, and the unconcerned webpage of a large number of users can be returned, reduce acquisition effective information Efficiency.And some provide the platform of company information, more there are problems that update not in time.
Therefore, how conveniently to obtain comprehensive company information is asking in the presence of current network information obtains Topic, it is necessary to propose a kind of method and system of internet data effective acquisition, the enterprise needed for realization orientation acquisition user is most New information.
The content of the invention
Traditional company information website obtains the method for data there are more limitation, first, can not be searched by traditional Index holds up the complete quality data for accurately and effectively obtaining and being largely hidden in deep net, second is that using the information traveled through one by one Way of search can waste substantial amounts of system resource so that the overlong time of acquisition of information, inefficiency.In view of the above-mentioned problems, this The method and system that a kind of network information of disclosure of the invention obtains, the particularly a kind of webpage information acquisition side based on crawler system Method and system, for obtaining required company-related information.
The problem of in the presence of above- mentioned information acquisition process, the present invention propose a kind of data message acquisition methods, For obtain with(User)The associated data message of information specified, the described method includes:It is obtained according to the information specified Take corresponding webpage information;Search strategy is determined according to the page layout format mode;Object is obtained according to the search strategy The page;Extract the data message in the page.
Further, it is described to be included according to the corresponding webpage information of the acquisition of information specified:It is obtained based on http protocol The corresponding webpage is taken, and receives the webpage information of return.
Further, the search strategy includes the combination of depth-first retrieval, breath first search and/or the two.
Further, it is described to be included according to the search strategy acquisition object page:One is obtained by multi-threaded network reptile Or multiple object pages URL and download the object page.
Further, the data message in the extraction page includes:The URL addresses in URL queues are obtained, DNS name resolution is carried out to URL addresses, establishes the Socket connections of server corresponding with the URL, and sends acquisition request The html data file of the page, wherein, html data file includes the data message.
Further, the method further includes the fresh information for obtaining the webpage, the update for obtaining the webpage Webpage that the step of information captured including regular return visit, detection webpage have unchanged, removal necrosis link and/or update the data Storehouse.
And wherein, described to designate the information as enterprise name, the data message is believed for data relevant with the enterprise Breath.
On the other hand, a kind of data message acquisition methods proposed according to the present invention, while also propose a kind of data letter Acquisition system is ceased, the system comprises:Retrieve device, selection device, acquisition device and processing unit;Wherein, the retrieval Device further includes information unit, for obtaining corresponding webpage information according to the specify information of described information unit;The selection Device is used for page layout mode included in the webpage information obtained according to the retrieval device, selection retrieval plan Slightly;The acquisition device is used to obtain the object page of the corresponding webpage acquired in the retrieval unit;And the processing Device is used to extract the data message in the page.
Further, the retrieval device is used to obtain the corresponding webpage based on http protocol, and it is single to further include reception Member, for receiving the webpage information returned.
Further, the search strategy includes the combination of depth-first retrieval, breath first search and/or the two.
Further, the acquisition device further includes web crawlers unit, and the web crawlers unit passes through multithreading net Network reptile obtains the URL of one or more corresponding pages and downloads the corresponding page.
Further, the processing unit further includes:Address processing unit, for obtaining the URL addresses in URL queues, DNS name resolution is carried out to URL addresses;Connection unit, for establishing the Socket connections of server corresponding with the URL; Acquiring unit is used for and sends the html data file of the page described in acquisition request, wherein, html data file includes described Data message.
Further, the system also includes updating device, for obtaining the fresh information of the webpage, the acquisition institute State webpage fresh information include regular return visit captured webpage, detection webpage have it is unchanged, removal necrosis link and/or more New database.
And it is described designate the information as enterprise name, the data message is and the relevant data message of the enterprise.
In conclusion data message acquisition methods disclosed in this invention and system, by crawler technology and with being directed to The search strategy of property can complete data in the excavation netted deeply, user is made to get substantial amounts of valid data in a short time, is kept away Exempt to each independent Web to inquire about one by one, provided one-stop information service to the user, improve the efficiency of gathered data.
Description of the drawings
The information acquisition method that Fig. 1 one embodiment of the invention provides;
The method for the acquisition object page that Fig. 2 another embodiment of the present invention provides;
The method for the structuring extraction page data information that Fig. 3 another embodiment of the present invention provides;
The dom tree schematic diagram that Fig. 4 the embodiment of the present invention provides;
The Information Acquisition System that Fig. 5 the embodiment of the present invention provides;
The company information search system that Fig. 6 another embodiment of the present invention provides.
Specific embodiment
For those skilled in the art is made to better understood when technical scheme, below in conjunction with specification Appended Figure of description is purged technical scheme complete description.Obviously, detailed description below is only It is merely some embodiments of the present invention, those skilled in the art does not pay on the basis of implementation below is understood The other embodiment or its combination that creative work is obtained belong to the technical concept and protection domain of the present invention.
One embodiment of the invention provides a kind of data message acquisition methods, as shown in Figure 1, comprising the following steps:
S1 obtains corresponding webpage information according to specify information.
The data message acquisition methods are for obtaining data message associated with specify information, for example, can root It is obtained and the company information needed for the relevant user of the enterprise according to the enterprise name specified by user.The corresponding webpage can be with It is the webpage included with the company-related information, for example, national credit information of enterprise publicity system, Chinese law court judge text Book net, Chinese execution information disclose net, official website of State Intellectual Property Office, official website of trademark office of State Administration for Industry & Commerce, state Family official website of Copyright Bureau and recruitment website etc., wherein, different webpages has different targetedly company informations.Specifically, The mode for obtaining webpage information can be to obtain the corresponding webpage based on http protocol, and receive the webpage letter of return Breath.
S2 determines search strategy according to the page layout format mode.
It is needing further to crawl the object for including user's information needed in webpage after getting corresponding webpage information The page, and crawl need of work of the web crawlers on Web is carried out according to certain search strategy algorithm, search strategy algorithm Generally include following four ergodic algorithm:Depth-priority-searching method, breadth first algorithm, heuristic search algorithm, automatic classification are searched Rope algorithm.For common company information webpage, page layout generally all has the characteristics that:First layer has retrieval Entrance is all Enterprise Lists after, such as in national credit information of enterprise publicity systematic search " Huawei Technologies ", is passed through After the access entry search " Huawei Technologies " of first layer, it will show that enterprise of the name comprising " Huawei Technologies " of enterprise arranges at next layer Table.Therefore, for the webpage with above-mentioned layout characteristics, depth-first retrieval mode and breath first search mode are taken The method being combined scans for webpage URL, provides URL queues so that reptile is most fast to obtain page link.
Specifically, the thought of depth-first search is one figure of traversal as deep as possible, from some vertex of figure, visit It asks all vertex in figure, and each vertex is made only to be accessed once, this process is called graph traversal.And breadth first search It is to travel through downwards layer by layer, the place different from depth-first search is that breadth first search can be to avoid past always Under endless loop.
S3 obtains the object page according to the search strategy.
After search strategy is determined, the object page of information needed is included based on the acquisition of identified search strategy.Root According to the data type that present invention needs obtain, search is oriented by given URL to improve the recall precision for obtaining data, Since information is multifarious, it is only necessary to industry relevant information, without traveling through in entire internet.Specifically, obtain object The mode of the page be using multi-threaded network reptile, in a manner that depth-first retrieval and breath first search are combined come pair Webpage URL is retrieved, and crawls the URL of one or more object pages, and according to the URL downloaded object pages.Wherein, Web crawlers uses theme network crawler, is called focused web crawler(Focused Crawler), refer to selectively creep that A bit with the web crawlers of the theme related pages pre-defined, due to Theme Crawler of Content and general reptile it is most important difference lies in Theme Crawler of Content only selects and the setting relevant page of theme, it is thus possible to reduce the webpage of time that reptile creeps and required traversal Quantity improves recall precision.
In another embodiment of the invention, further included according to the search strategy acquisition object page as shown in Figure 2 Step:
S301 using web crawlers, carries out crawling traversal using the initial URL pages as entrance.
Web crawlers can be gathered with multithreading, so can more effectively, capture web page contents more quickly.Preferably, may be used To be climbed using selectively the creep network of those demand information type related pages with pre-defining of theme network crawler Worm.It is carried out using crawl need of work of the network of network reptile on Web according to certain policing algorithm, such as depth-first Algorithm, breadth first algorithm, heuristic search algorithm search algorithm classifiably automatically.The thought of depth-first search is as far as possible Deep one figure of traversal, from some vertex of figure, accesses all vertex in figure, and makes each vertex only accessed one Secondary, this process is called graph traversal.And breadth first search is to travel through downwards layer by layer, it is different from depth-first search Be that it can be to avoid endless loop always down.According to company information data internet page layout design feature:The One layer has search entrance, is all Enterprise Lists after, can take depth-first search and breadth first search phase With reference to method webpage URL is scanned for, such crawlers most fast can obtain page link.
S302 analyzes the URL pages crawled and filters duplicate removal.
Internet information is numerous and diverse, and in reptile is used to carry out crawling work, reptile may repeat to be put into be already present on to treat The URL of troop is climbed, will so reduce the work efficiency of reptile.Therefore to URL during reptile analyzes extraction webpage URL Carrying out duplicate removal filtration also becomes more and more important, and the contradiction point of this technology is that crawler technology is in itself for depositing It stores up space and rate request is very high, and the work efficiency of reptile is also inevitably influenced in duplicate removal process.In order to use De-weight method efficient, without taking much space, while it is also required to ensure the accuracy of duplicate removal, available De-weight method bag It includes based on database duplicate removal, based on memory duplicate removal and based on Bloom filter(bloom filter)Duplicate removal.
S303 obtains the URL queues after optimization.
Web crawlers is obtained and the URL after filtered duplicate removal is stored in URL queues.
S4 extracts the data message in the page.
The content by the reptile extraction page is needed after the object page is obtained, the Text Feature Extraction of web crawlers is first from right As the page URL queues in obtain URL addresses, DNS name resolution is carried out to URL addresses, is parsed in URL in Web server Address with the service of access target server, establish the Socket connections of client and server, then sent to HTTP Request, to obtain the data of content page HTML.It needs to compile page pretreatment after the HTML of content pages is obtained Code conversion and Web de-noising.
Since html file is the semi-structured data of self-described, and semi-structured data is difficult to be employed program directly to make With in order to extract useful information from its file, so its structuring must be made.Structured message extracting method further include as Step shown in Fig. 3:
S401 is parsed and is generated dom tree.
DOM(Document ObjectModel)Tree is as shown in figure 4, entire chapter web page code is exactly one after dom tree structure Tree, label is then the node of tree, removes the node unrelated with text during content is extracted, and recurrence or uses other Algorithm traversal dom tree obtains content node, and extracts wherein content;For notes content, then deletion of node is follow-up from dom tree Continuous traversal obtains content.By html document be parsed into dom tree can by means of the third party library Beautiful Soup storehouses of Python, The major function in the storehouse is from webpage capture data, provides the functions such as some functions processing navigation, search, modification parsing tree, also Document transform coding will can be inputted automatically, provide different parsing strategies and powerful speed to the user.
S402, the structured message based on template extract.
In some embodiments, model customization, the set of custom built forms can be carried out to the content for clearly requiring acquisition Constitute page decimation rule storehouse, when Extracting Information passes through certain regular expression according to page decimation rule storehouse and carries out webpage Text message extract, and judge information whether with template matches.Regular expression is can to accomplish pattern match and replace strong Big function, a pattern match are exactly a character string, and a pattern matching expression is by unitary and binary operator Composition, space and tab can be used for separating keyword.
The information extraction mode of webpage includes the data extraction based on Wrapper, the data based on machine learning carry It takes, the data extraction based on HTML construction trees and the data based on Web query are extracted.
Data extraction method based on wrapper carries out the decimation rule or mould that information extraction depends on people to establish by hand Formula by the way that template analysis are obtained with the position of text in the page, can accurately position body matter, and extract the accurate of text Rate is high, and extraction rate is fast.However, this method can not handle miscellaneous webpage in Web with unified template, without general All over applicability, and the rule artificially established also is difficult to ensure whole systemic logicality, meanwhile, some decimation rules are all pins To some fields fixed setting, have height field correlation and poor portability, generation and maintenance cost compared with It is high, it is necessary to excessive manual intervention.
Data pick-up based on machine learning is by carrying out Dom achievements, Dom solutions after being pre-processed to web data Analysis carries out parting operation using the good model of precondition to webpage(Identify the structure of webpage:News, community forum etc.), Then piecemeal is carried out to webpage according to features such as text size, text position, bookmark names to extract to obtain relevant information, it can Realize some basic extraction demands such as title, text extracting, structure of web page classification.However, the scheme based on machine learning is only General, relative coarseness information extraction can be met, accurately field can not be extracted.
Data extraction method based on HTML structure tree needs first to be positioned to the information extracted according to design feature, passes through The characteristic of Html is set to construct, and decimation rule is formed by the form for forming regular expression, and carrying out operation to tree realizes data It extracts.
Data pick-up based on Web query is a kind of information extraction using Web as information source, from semi-structured Web Data are extracted in document, is that data more structuring, semanteme become apparent from, facility is provided for the Web query of user.Based on Web The information extraction of inquiry is managed and inquired about to the data on internet using database technology, and Web information is extracted and is converted Web page document is inquired about into the Web query language of standard.
The elementary tactics of above-mentioned four kinds of information extractions is similar, is all first to data prediction, is parsed into dom tree, then root Structuring extraction is carried out to data according to rule, template or training algorithm, database is stored data into after extracting.For certain The information in one field carries out concentration extraction, is such as directed to specific company information(Such as judge including various merit events, law The legal informations such as document and laws and regulations)When carrying out concentrating extraction, relatively numerous and jumbled network data is relatively easy to be formed certain Decimation rule can comprehensively utilize above-mentioned four kinds of information extraction methods and combinations thereof.
S403 stores the information being drawn into the form of structuring to database.
The data message being drawn into needs to be stored in database, convenient for extraction and application.The present invention is primarily directed to Information Network Data message on standing, its main feature is that most of data types are relatively uniform, such as essential information of enterprise etc..Selectable number Include according to storehouse
1.MySQL databases
MySQL is a kind of Relational DBMS of open source code, in general a very handy data Storehouse, development environment is windows systems, multilingual support, because it is operated and management data convenience, high-performance, low cost, and Its kernel thread is multithreading, becomes the optimal selection of enterprise's storage data.
2.MongoDB databases
MongoDB is fully functional in non-relational database and very alike with relational database storage system.It is One pattern is free, and the Document image analysis towards set is compared with MySQL, its data of increasing income have the branch of commercial company It holds, it is safer.
Since MongoDB databases have very good Data expansion performance, between non-relational database and relation Between database, for the incomplete situation of data, it can be solved with its expanded function, it still can be by different information point Document stores.But MongoDB, without ripe maintenance tool as MySQL, this is all a value for exploitation and IT operations The place that must be paid attention to, and MongoDB occupied spaces are excessive.
In further embodiments, information acquisition method of the invention further includes:
S5 obtains the fresh information of webpage.
Network information renewal speed is very fast, it is necessary to the webpage that reptile regular return visit captured, detection webpage have it is unchanged, Useless necrosis link is removed, storehouse is updated the data, newest information can be obtained in time convenient for user.
The present invention also provides a kind of data messages to obtain system 100, as shown in Figure 1, including:
Device 110 is retrieved, with information unit 111 and receiving unit 112, for obtaining and the specified letter of information unit 111 The associated data message of manner of breathing.
For example, specified information can be company information input by user, retrieval device 110 is according to specified by user Enterprise name obtains and the company information needed for the relevant user of the enterprise.The corresponding webpage can be include with it is described The webpage of company-related information, for example, national credit information of enterprise publicity system, Chinese law court judgement document's net, China's execution Information discloses net, official website of State Intellectual Property Office, official website of trademark office of State Administration for Industry & Commerce, official of National Copyright Administration of the People's Republic of China net It stands and recruitment website etc., wherein, different webpages has different targetedly company informations.Specifically, obtain webpage information Mode can be to obtain the corresponding webpage based on http protocol, and pass through receiving unit 112 and receive the webpage returned Information.
Selection device 120, the selection device 120 are used for the webpage information obtained according to the retrieval device 110 Included in page layout mode, select search strategy.
It is needing further to crawl the object for including user's information needed in webpage after getting corresponding webpage information The page, and crawl need of work of the web crawlers on Web is carried out according to certain search strategy algorithm, search strategy algorithm Generally include following four ergodic algorithm:Depth-priority-searching method, breadth first algorithm, heuristic search algorithm, automatic classification are searched Rope algorithm.For common company information webpage, page layout generally all has the characteristics that:First layer has retrieval Entrance is all Enterprise Lists after, such as in national credit information of enterprise publicity systematic search " Huawei Technologies ", is passed through After the access entry search " Huawei Technologies " of first layer, it will show that enterprise of the name comprising " Huawei Technologies " of enterprise arranges at next layer Table.Therefore, for the webpage with above-mentioned layout characteristics, depth-first retrieval mode and breath first search mode are taken The method being combined scans for webpage URL, provides URL queues so that reptile is most fast to obtain page link.
Acquisition device 130 has web crawlers unit 131 and duplicate removal unit 132, for after search strategy is determined, obtaining Take the object page of the corresponding webpage acquired in the retrieval device 110.
Specifically, the data type obtained as needed, web crawlers unit 131 using given initial URL as entrance, into Row beam search obtains the recall precision of data to improve, since information is multifarious, it is only necessary to industry relevant information, nothing It need to travel through in entire internet.The mode that web crawlers unit 131 obtains the object page is using multi-threaded network reptile, is passed through The mode that depth-first retrieval and breath first search are combined retrieves webpage URL, crawls one or more institutes The URL of the object page is stated, and according to the URL downloaded object pages, wherein, web crawlers uses theme network crawler, is called focusing Web crawlers(Focused Crawler), refer to selectively creep those and the net of the theme related pages pre-defined Network reptile, due to Theme Crawler of Content and general reptile is most important only selects difference lies in Theme Crawler of Content and the setting relevant page of theme Face, it is thus possible to reduce the webpage quantity of time that reptile creeps and required traversal, improve recall precision.
In another embodiment of the invention, duplicate removal unit 132 is additionally operable to the URL crawled to web crawlers unit 131 The page is analyzed and filters duplicate removal.Specifically, since internet information is numerous and diverse, in reptile is used to carry out crawling work, climb Worm may repeat to be put into the URL for being already present on Dai Pa troops, will so reduce the work efficiency of reptile, therefore in reptile point Carrying out duplicate removal filtration to URL during analysis extraction webpage URL also becomes more and more important.In order to using it is efficient, De-weight method without taking much space, while it is also required to ensure the accuracy of duplicate removal, available De-weight method is included based on number According to storehouse duplicate removal, based on memory duplicate removal and based on Bloom filter(bloom filter)Duplicate removal.
Further, duplicate removal unit 132 be additionally operable to obtain web crawlers unit and filter duplicate removal through duplicate removal unit 132 URL afterwards is stored in URL queues.
Processing unit 140 has address processing unit 141, connection unit 142, acquiring unit 143 and pretreatment unit 144, for extracting the data message in the page.
Wherein, address processing unit 141 obtains URL addresses from the URL queues of the object page, and DNS is carried out to URL addresses Domain name mapping, parses service of the address in URL in Web server with access target server, and connection unit 142 establishes visitor The Socket of family end and server end connections, acquiring unit 143 sends to HTTP and asks, to obtain the number of content page HTML According to.After the HTML of content pages is obtained, pretreatment unit 144 needs to carry out code conversion and Web de-noising to the page.
Since html file is the semi-structured data of self-described, and semi-structured data is difficult to be employed program directly to make With in order to extract useful information from its file, processing unit 140 further includes structuring unit 145, to acquiring unit 143 html datas obtained carry out structuring processing and extract required information included in it.
More specifically, structuring unit 145 parses html data and generates dom tree as shown in Figure 4, in extraction content Remove the node unrelated with text in the process, recurrence or use other algorithms traversal dom tree obtain content node, and extract it Middle content.In some embodiments, structuring unit can also carry out model customization, customization to the content for clearly requiring acquisition The set of template constitutes page decimation rule storehouse, passes through certain regular expressions according to the page decimation rule storehouse during Extracting Information Formula carry out Web page text information extraction, and judge information whether with template matches.
In some embodiments, processing unit 140 further includes policy selection unit 146, for for some field Specific information carry out concentrate extract when, select appropriate information extraction method.Wherein, specific information can be enterprise's letter Breath, including legal informations such as various merit events, law judgement document and laws and regulations etc..Information extraction method can be based on The data extraction method of wrapper, the data extraction method based on machine learning, based on HTML construction tree data extraction method, Any combination of data extraction method or above method based on Web query.
Further, the information that Information Acquisition System 100 also needs to be drawn into is stored in the form of structuring into database, In order to extraction and application.The present invention unites primarily directed to the data on information site its main feature is that most of data types are opposite One, such as essential information of enterprise etc..
In some embodiments, Information Acquisition System further includes updating device 150, for obtaining the fresh information of webpage.
Network information renewal speed is very fast, it is necessary to the webpage that reptile regular return visit captured, detection webpage have it is unchanged, Useless necrosis link is removed, storehouse is updated the data, newest information can be obtained in time convenient for user.
In another embodiment, the present invention also provides a kind of company information search system 200, obtained including user interface 210, information System 100 and database 220 are taken, which looks into according to information input by user, such as enterprise name in database 210 It looks for corresponding company information and is exported according to certain preset strategy.Wherein, database 210 can be local data base and/or Network data base, wherein being stored with the data message acquired in Information Acquisition System 100.Further, for the ease of user more Add accurately and expeditiously obtain required information, the order when exporting corresponding company information according to the degree of correlation from high to low Output.
It should be noted that term " comprising " and " having " and their any deformation, it is intended that covering is non-exclusive Include, be not necessarily limited to clearly arrange for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit gone out, but may include not list clearly or consolidate for these processes, method, product or equipment The other steps or unit having.In addition, those of ordinary skill in the art be appreciated that realize above-described embodiment whole or Part steps can be completed by hardware, and relevant hardware can also be instructed to complete by program, and the program can be with It is stored in a kind of computer readable storage medium, storage medium mentioned above can be read-only memory, disk or CD Deng.
The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the invention, for the common of fields For technical staff, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made is any Modification, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims (38)

1. a kind of data message acquisition methods, for obtaining data message associated with specify information, which is characterized in that described Method includes:
According to the corresponding webpage information of the acquisition of information specified;
Search strategy is determined according to the page layout format mode;
The object page is obtained according to the search strategy;
Extract the data message in the page.
2. data message acquisition methods as described in claim 1, which is characterized in that described according to the acquisition of information specified Corresponding webpage information includes:
The corresponding webpage is obtained based on http protocol, and receives the webpage information of return.
3. the data message acquisition methods as any one of claim 1-2, which is characterized in that the search strategy includes The combination of depth-first retrieval, breath first search and/or the two.
4. the data message acquisition methods as any one of claim 1-3, which is characterized in that according to the cloth of the webpage Office's mode determines that search strategy includes:
The page layout includes first layer access entry and second layer information list.
5. data message acquisition methods as described in claim 1, which is characterized in that described that object page is obtained according to search strategy Face includes:
The URL of one or more object pages is obtained by multi-threaded network reptile and downloads the object page.
6. data message acquisition methods as claimed in claim 5, which is characterized in that the multi-threaded network reptile is focusing Web crawlers(Focused Crawler).
7. data message acquisition methods according to claim 5, which is characterized in that the acquisition is one or more described right Deduplication operation is carried out to the URL of the webpage as the URL of the page is further included, the deduplication operation is based on database duplicate removal, base In memory duplicate removal and/or duplicate removal based on Bloom filter.
8. data message acquisition methods as described in claim 1, which is characterized in that the number in the extraction page It is believed that breath includes:
The URL addresses in URL queues are obtained, DNS name resolution is carried out to URL addresses, establishes server corresponding with the URL Socket connections, and send the html data file of the page described in acquisition request, wherein, html data file includes described Data message.
9. data message acquisition methods as claimed in claim 8, which is characterized in that after the html data file is obtained Further include the pretreatment that code conversion and denoising are carried out to html file.
10. data message acquisition methods as claimed in claim 9, which is characterized in that the pre- place of the code conversion and denoising After reason, further include:Html file described in structuring.
11. data message acquisition methods as claimed in claim 10, which is characterized in that html file described in the structuring Including being parsed to its content and generating DOM(Document Object Model)Tree, removes unrelated node, traversal obtains The content node taken, to required content customization template.
12. data message acquisition methods as claimed in claim 11, which is characterized in that described to required content customization template Including the content obtained to needs, by pattern match and progress information extraction is replaced, obtains structured message data.
13. data message acquisition methods as claimed in claim 10, which is characterized in that the method further includes:For described The field involved by information specified, the strategy of the data message included in html file described in selective extraction.
14. data message acquisition methods according to claim 13, which is characterized in that in the extraction html file Comprising the data message strategy include the data extraction method based on wrapper, based on machine learning data extraction Method constructs the arbitrary of the data extraction method set, the data extraction method based on Web query or above method based on HTML Combination.
15. data message acquisition methods according to claim 1, which is characterized in that the method is further included described in acquisition The fresh information of webpage, it is described obtain the webpage fresh information the step of captured including regular return visit webpage, detection Webpage has unchanged, removal necrosis link and/or updates the data storehouse.
16. the data message acquisition methods as described in claim 1-15, which is characterized in that described to designate the information as enterprise's name Claim, the data message is and the relevant data message of the enterprise.
17. a kind of Information Acquisition System, for obtain with(User)The associated data message of information specified, feature exist In, the system comprises:
Retrieve device, selection device, acquisition device and processing unit;
Wherein,
The retrieval device further includes information unit, believes for obtaining corresponding webpage according to the specify information of described information unit Breath;
The selection device is used for page layout mode included in the webpage information obtained according to the retrieval device, Select search strategy;
The acquisition device is used to obtain the object page of the corresponding webpage acquired in the retrieval unit;
And
The processing unit is used to extract the data message in the page.
18. Information Acquisition System according to claim 17, which is characterized in that the retrieval device is used for based on HTTP Agreement obtains the corresponding webpage, further includes receiving unit, for receiving the webpage information returned.
19. Information Acquisition System according to claim 17, which is characterized in that the search strategy is examined including depth-first The combination of rope, breath first search and/or the two.
20. Information Acquisition System according to claim 19, which is characterized in that the page layout is retrieved including first layer Entrance and second layer information list.
21. Information Acquisition System according to claim 17, which is characterized in that the acquisition device further includes web crawlers Unit, the web crawlers unit by multi-threaded network reptile obtain one or more corresponding pages URL and under Carry the corresponding page.
22. Information Acquisition System according to claim 21, which is characterized in that the multi-threaded network reptile is focusing Web crawlers(Focused Crawler).
23. Information Acquisition System according to claim 21, which is characterized in that the acquisition device further includes duplicate removal list Member, for carrying out deduplication operation to the URL of the webpage, the deduplication operation is based on database duplicate removal, based on memory duplicate removal And/or the duplicate removal based on Bloom filter.
24. Information Acquisition System according to claim 17, which is characterized in that the processing unit includes:
For obtaining the URL addresses in URL queues, DNS name resolution is carried out to URL addresses for address processing unit;
Connection unit, for establishing the Socket connections of server corresponding with the URL;
Acquiring unit is used for and sends the html data file of the page described in acquisition request, wherein, html data file includes The data message.
25. Information Acquisition System according to claim 24, which is characterized in that it is single that the processing unit further includes pretreatment Member after obtaining the html data file in the acquiring unit, carries out html file code conversion and denoising Pretreatment.
26. Information Acquisition System according to claim 25, which is characterized in that the processing unit further includes structuring list Member, for after the pretreatment of the code conversion and denoising, html file described in structuring.
27. Information Acquisition System according to claim 26, which is characterized in that html file bag described in the structuring It includes and DOM is parsed and generated to its content(Document Object Model)Tree, removes unrelated node, and traversal obtains Content node, to required content customization template.
28. Information Acquisition System according to claim 27, which is characterized in that described to required content customization template bag The content obtained to needs is included, by pattern match and progress information extraction is replaced, obtains structured message data.
29. Information Acquisition System according to claim 24, which is characterized in that the processing unit further includes policy selection Unit, for the field involved by for the information specified, the number included in html file described in selective extraction It is believed that the strategy of breath.
30. Information Acquisition System according to claim 29, which is characterized in that the strategy includes the number based on wrapper The data extraction method of tree is constructed according to extracting method, the data extraction method based on machine learning, based on HTML, is looked into based on Web The data extraction method of inquiry or any combination of above method.
31. Information Acquisition System according to claim 17, which is characterized in that the system also includes updating devices, use In the fresh information for obtaining the webpage, the fresh information for obtaining the webpage include the webpage that regular return visit captured, Detection webpage has unchanged, removal necrosis link and/or updates the data storehouse.
32. according to the Information Acquisition System described in claim 17-31, which is characterized in that it is described to designate the information as enterprise name, The data message be and the relevant data message of the enterprise.
33. a kind of computer-readable medium for being stored with database, the database is used to store business data information, feature It is to appoint by such as claim 1-16 any one of them method or by claim 17-32 to be the business data information The information acquired in system described in one.
34. computer-readable medium according to claim 33, which is characterized in that the database is MySQL database Or MongoDB databases.
35. a kind of company information search system, including database, which is characterized in that the database is such as claim 33-34 Any one of them database, the system according to information input by user are searched and inputted with the user in the database The corresponding company information of information, and according to it is default strategy export.
36. company information search system according to claim 35, which is characterized in that the default strategy is according to phase Pass degree Sequential output.
37. such as claim 35-36 any one of them company information search systems, which is characterized in that described input by user Information is enterprise name.
38. a kind of computer-readable medium is stored thereon with instruction, which is characterized in that described instruction can be read by computer To perform such as claim 1-16 any one of them information acquisition methods.
CN201711381367.2A 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system Active CN108052632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711381367.2A CN108052632B (en) 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711381367.2A CN108052632B (en) 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system

Publications (2)

Publication Number Publication Date
CN108052632A true CN108052632A (en) 2018-05-18
CN108052632B CN108052632B (en) 2022-02-18

Family

ID=62130268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711381367.2A Active CN108052632B (en) 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system

Country Status (1)

Country Link
CN (1) CN108052632B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033203A (en) * 2018-06-29 2018-12-18 大连交通大学 A kind of feature extraction method for parallel processing towards big data
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers
CN109902217A (en) * 2019-03-20 2019-06-18 江苏科技大学 A kind of crawler software of astronomy data screening and downloading
CN111274217A (en) * 2020-01-10 2020-06-12 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device, computer equipment and storage medium
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
CN113157730A (en) * 2021-04-26 2021-07-23 中国人民解放军军事科学院国防科技创新研究院 Civil-military fusion policy information system
CN113343108A (en) * 2021-06-30 2021-09-03 中国平安人寿保险股份有限公司 Recommendation information processing method, device, equipment and storage medium
TWI764491B (en) * 2020-12-31 2022-05-11 重量科技股份有限公司 Text information automatically mining method and system
CN116361362A (en) * 2023-05-30 2023-06-30 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030200205A1 (en) * 2001-08-23 2003-10-23 Michael Meiresonne Method, process, and system for searching and identifying sources of goods and/or services over the internet
US20040111401A1 (en) * 2002-12-10 2004-06-10 Yuan-Chi Chang Using text search engine for parametric search
CN1920817A (en) * 2006-09-14 2007-02-28 浙江大学 Method for multiple resources pools integral parallel search in open websites
US20080281788A1 (en) * 2007-05-09 2008-11-13 Ophir Frieder Hierarchical structured abstract file system
US20110307479A1 (en) * 2010-06-10 2011-12-15 Microsoft Corporation Automatic Extraction of Structured Web Content
CN102694772A (en) * 2011-03-23 2012-09-26 腾讯科技(深圳)有限公司 Apparatus, system and method for accessing internet web pages
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
US8458227B1 (en) * 2010-06-24 2013-06-04 Amazon Technologies, Inc. URL rescue by identifying information related to an item referenced in an invalid URL
CN104899268A (en) * 2015-05-25 2015-09-09 浪潮集团有限公司 Distributed enterprise information vertical search method
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
CN106462645A (en) * 2016-01-07 2017-02-22 马岩 Network information search method and system
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030200205A1 (en) * 2001-08-23 2003-10-23 Michael Meiresonne Method, process, and system for searching and identifying sources of goods and/or services over the internet
US20040111401A1 (en) * 2002-12-10 2004-06-10 Yuan-Chi Chang Using text search engine for parametric search
CN1920817A (en) * 2006-09-14 2007-02-28 浙江大学 Method for multiple resources pools integral parallel search in open websites
US20080281788A1 (en) * 2007-05-09 2008-11-13 Ophir Frieder Hierarchical structured abstract file system
US20110307479A1 (en) * 2010-06-10 2011-12-15 Microsoft Corporation Automatic Extraction of Structured Web Content
US8458227B1 (en) * 2010-06-24 2013-06-04 Amazon Technologies, Inc. URL rescue by identifying information related to an item referenced in an invalid URL
CN102694772A (en) * 2011-03-23 2012-09-26 腾讯科技(深圳)有限公司 Apparatus, system and method for accessing internet web pages
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
CN104899268A (en) * 2015-05-25 2015-09-09 浪潮集团有限公司 Distributed enterprise information vertical search method
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system
CN106462645A (en) * 2016-01-07 2017-02-22 马岩 Network information search method and system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
R.JEBERSON RETNA RAJ ET AL.: "Implementation of template independent web news extraction approach, noise removal and structured data detection to improve search for location based services", 《2017 INTERNATIONAL CONFERENCE ON POWER AND EMBEDDED DRIVE CONTROL (ICPEDC)》 *
彭冬等: "面向Web论坛的网络信息获取技术及系统实现", 《计算机工程与科学》 *
翟东升 编著: "《专利知识挖掘关键技术研究》", 31 January 2013, 北京:知识产权出版社 *
肖雷: "面向论坛的文本特征提取及分类技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
贾海蕾等: "企业信息门户中网络蜘蛛的设计与实现", 《软件导刊》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033203A (en) * 2018-06-29 2018-12-18 大连交通大学 A kind of feature extraction method for parallel processing towards big data
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers
CN109902217A (en) * 2019-03-20 2019-06-18 江苏科技大学 A kind of crawler software of astronomy data screening and downloading
CN111274217A (en) * 2020-01-10 2020-06-12 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device, computer equipment and storage medium
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
TWI764491B (en) * 2020-12-31 2022-05-11 重量科技股份有限公司 Text information automatically mining method and system
CN113157730A (en) * 2021-04-26 2021-07-23 中国人民解放军军事科学院国防科技创新研究院 Civil-military fusion policy information system
CN113343108A (en) * 2021-06-30 2021-09-03 中国平安人寿保险股份有限公司 Recommendation information processing method, device, equipment and storage medium
CN113343108B (en) * 2021-06-30 2023-05-26 中国平安人寿保险股份有限公司 Recommended information processing method, device, equipment and storage medium
CN116361362A (en) * 2023-05-30 2023-06-30 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification
CN116361362B (en) * 2023-05-30 2023-08-11 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Also Published As

Publication number Publication date
CN108052632B (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN108052632A (en) A kind of method for obtaining network information, system and company information search system
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
CN101971172B (en) Mobile sitemaps
US9613149B2 (en) Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
DE19718834B4 (en) Navigation in hypermedia using soft hyperlinks
US8380693B1 (en) System and method for automatically identifying classified websites
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
CN110704411A (en) Knowledge graph building method and device suitable for art field and electronic equipment
JP2003076715A (en) Method and system for retrieving web pages, program and recording medium
CN103399862B (en) Determine the method and apparatus of search index information corresponding to target query sequence
CN110019616A (en) A kind of POI trend of the times state acquiring method and its equipment, storage medium, server
CN104391978A (en) Method and device for storing and processing web pages of browsers
KR20170073693A (en) Extracting similar group elements
CN110110171A (en) Enterprise information searching method, device and electronic equipment
Devi et al. An efficient approach for web indexing of big data through hyperlinks in web crawling
WO2007057809A2 (en) Method of obtaining a representation of a text
CN106776640A (en) A kind of stock information information displaying method and device
US20200293581A1 (en) Systems and methods for crawling web pages and parsing relevant information stored in web pages
CN116226494B (en) Crawler system and method for information search
CN100357942C (en) Mobile internet intelligent information retrieval engine based on key-word retrieval
Ganguly et al. Performance optimization of focused web crawling using content block segmentation
CN106326353A (en) Method and equipment for providing representation information
KR100496384B1 (en) Search engine, search system, method for making a database in a search system, and recording media
Färber Linked Crunchbase: A linked data API and RDF data set about innovative companies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant