CN108052632A - A kind of method for obtaining network information, system and company information search system - Google Patents
A kind of method for obtaining network information, system and company information search system Download PDFInfo
- Publication number
- CN108052632A CN108052632A CN201711381367.2A CN201711381367A CN108052632A CN 108052632 A CN108052632 A CN 108052632A CN 201711381367 A CN201711381367 A CN 201711381367A CN 108052632 A CN108052632 A CN 108052632A
- Authority
- CN
- China
- Prior art keywords
- information
- data
- webpage
- data message
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of method for obtaining network information, system and company information search system, including obtain with the associated webpage information of specify information, the object pages of associating web pages, and the data message in the extracting object page are obtained according to selected search strategy.The present invention is by crawler technology and with targetedly search strategy, data are completed in the excavation netted deeply, allows users to get substantial amounts of valid data in a short time, avoids to each independent Web inquiring about one by one, it has provided one-stop information service to the user, has improved the efficiency of gathered data.
Description
Technical field
The present invention relates to the method and system that a kind of network information obtains, particularly a kind of webpage letter based on crawler system
Cease acquisition methods and system..
Background technology
In the current big data epoch, the vast resources on network allow user to have more visitors or business than one can attend to, being largely distributed, easily purchase
Information is come into being.For example, the relevant information if necessary to obtain enterprise, it can be directly by including national credit information of enterprise
Publicity system, Chinese law court judgement document's net, Chinese execution information disclose net, official website of State Intellectual Property Office, national work
The correlation official website such as official website of trademark office of business general bureau, official website of the National Copyright Administration of the People's Republic of China and recruitment website searches.It is however, above-mentioned
Company information involved by all kinds of websites is different, for example, national credit information of enterprise publicity system is believed including business license
The information such as breath, key personnel, judgement document's net mainly for discriminative information, government website generally comprised business standing data and
Acceptance of the bid data, and recruitment website is then more related to the information such as position, wage.As it can be seen that different information sources is in different networks
Platform, and the data on platform are typically independent and not shared, if it is desired to targetedly obtaining one or more enterprises
Relevant information then needs to be inquired about by different platforms, relatively complicated for user.
On the other hand, the tools such as the industrial and commercial information of enterprise, recruitment information, the judgement document being related to and intellectual property information
There is the property of deep layer network, wherein, the concept netted deeply is defined compared with cover web, refers to that those cannot be by general search
The acquired content of operation.In order to effectively easily obtain the required network information and resource, search engine is as common letter
Breath gopher becomes entrance and platform that user accesses internet.But general search engine has certain limitation
Property, the content netted deeply is often difficult to obtain, and the unconcerned webpage of a large number of users can be returned, reduce acquisition effective information
Efficiency.And some provide the platform of company information, more there are problems that update not in time.
Therefore, how conveniently to obtain comprehensive company information is asking in the presence of current network information obtains
Topic, it is necessary to propose a kind of method and system of internet data effective acquisition, the enterprise needed for realization orientation acquisition user is most
New information.
The content of the invention
Traditional company information website obtains the method for data there are more limitation, first, can not be searched by traditional
Index holds up the complete quality data for accurately and effectively obtaining and being largely hidden in deep net, second is that using the information traveled through one by one
Way of search can waste substantial amounts of system resource so that the overlong time of acquisition of information, inefficiency.In view of the above-mentioned problems, this
The method and system that a kind of network information of disclosure of the invention obtains, the particularly a kind of webpage information acquisition side based on crawler system
Method and system, for obtaining required company-related information.
The problem of in the presence of above- mentioned information acquisition process, the present invention propose a kind of data message acquisition methods,
For obtain with(User)The associated data message of information specified, the described method includes:It is obtained according to the information specified
Take corresponding webpage information;Search strategy is determined according to the page layout format mode;Object is obtained according to the search strategy
The page;Extract the data message in the page.
Further, it is described to be included according to the corresponding webpage information of the acquisition of information specified:It is obtained based on http protocol
The corresponding webpage is taken, and receives the webpage information of return.
Further, the search strategy includes the combination of depth-first retrieval, breath first search and/or the two.
Further, it is described to be included according to the search strategy acquisition object page:One is obtained by multi-threaded network reptile
Or multiple object pages URL and download the object page.
Further, the data message in the extraction page includes:The URL addresses in URL queues are obtained,
DNS name resolution is carried out to URL addresses, establishes the Socket connections of server corresponding with the URL, and sends acquisition request
The html data file of the page, wherein, html data file includes the data message.
Further, the method further includes the fresh information for obtaining the webpage, the update for obtaining the webpage
Webpage that the step of information captured including regular return visit, detection webpage have unchanged, removal necrosis link and/or update the data
Storehouse.
And wherein, described to designate the information as enterprise name, the data message is believed for data relevant with the enterprise
Breath.
On the other hand, a kind of data message acquisition methods proposed according to the present invention, while also propose a kind of data letter
Acquisition system is ceased, the system comprises:Retrieve device, selection device, acquisition device and processing unit;Wherein, the retrieval
Device further includes information unit, for obtaining corresponding webpage information according to the specify information of described information unit;The selection
Device is used for page layout mode included in the webpage information obtained according to the retrieval device, selection retrieval plan
Slightly;The acquisition device is used to obtain the object page of the corresponding webpage acquired in the retrieval unit;And the processing
Device is used to extract the data message in the page.
Further, the retrieval device is used to obtain the corresponding webpage based on http protocol, and it is single to further include reception
Member, for receiving the webpage information returned.
Further, the search strategy includes the combination of depth-first retrieval, breath first search and/or the two.
Further, the acquisition device further includes web crawlers unit, and the web crawlers unit passes through multithreading net
Network reptile obtains the URL of one or more corresponding pages and downloads the corresponding page.
Further, the processing unit further includes:Address processing unit, for obtaining the URL addresses in URL queues,
DNS name resolution is carried out to URL addresses;Connection unit, for establishing the Socket connections of server corresponding with the URL;
Acquiring unit is used for and sends the html data file of the page described in acquisition request, wherein, html data file includes described
Data message.
Further, the system also includes updating device, for obtaining the fresh information of the webpage, the acquisition institute
State webpage fresh information include regular return visit captured webpage, detection webpage have it is unchanged, removal necrosis link and/or more
New database.
And it is described designate the information as enterprise name, the data message is and the relevant data message of the enterprise.
In conclusion data message acquisition methods disclosed in this invention and system, by crawler technology and with being directed to
The search strategy of property can complete data in the excavation netted deeply, user is made to get substantial amounts of valid data in a short time, is kept away
Exempt to each independent Web to inquire about one by one, provided one-stop information service to the user, improve the efficiency of gathered data.
Description of the drawings
The information acquisition method that Fig. 1 one embodiment of the invention provides;
The method for the acquisition object page that Fig. 2 another embodiment of the present invention provides;
The method for the structuring extraction page data information that Fig. 3 another embodiment of the present invention provides;
The dom tree schematic diagram that Fig. 4 the embodiment of the present invention provides;
The Information Acquisition System that Fig. 5 the embodiment of the present invention provides;
The company information search system that Fig. 6 another embodiment of the present invention provides.
Specific embodiment
For those skilled in the art is made to better understood when technical scheme, below in conjunction with specification
Appended Figure of description is purged technical scheme complete description.Obviously, detailed description below is only
It is merely some embodiments of the present invention, those skilled in the art does not pay on the basis of implementation below is understood
The other embodiment or its combination that creative work is obtained belong to the technical concept and protection domain of the present invention.
One embodiment of the invention provides a kind of data message acquisition methods, as shown in Figure 1, comprising the following steps:
S1 obtains corresponding webpage information according to specify information.
The data message acquisition methods are for obtaining data message associated with specify information, for example, can root
It is obtained and the company information needed for the relevant user of the enterprise according to the enterprise name specified by user.The corresponding webpage can be with
It is the webpage included with the company-related information, for example, national credit information of enterprise publicity system, Chinese law court judge text
Book net, Chinese execution information disclose net, official website of State Intellectual Property Office, official website of trademark office of State Administration for Industry & Commerce, state
Family official website of Copyright Bureau and recruitment website etc., wherein, different webpages has different targetedly company informations.Specifically,
The mode for obtaining webpage information can be to obtain the corresponding webpage based on http protocol, and receive the webpage letter of return
Breath.
S2 determines search strategy according to the page layout format mode.
It is needing further to crawl the object for including user's information needed in webpage after getting corresponding webpage information
The page, and crawl need of work of the web crawlers on Web is carried out according to certain search strategy algorithm, search strategy algorithm
Generally include following four ergodic algorithm:Depth-priority-searching method, breadth first algorithm, heuristic search algorithm, automatic classification are searched
Rope algorithm.For common company information webpage, page layout generally all has the characteristics that:First layer has retrieval
Entrance is all Enterprise Lists after, such as in national credit information of enterprise publicity systematic search " Huawei Technologies ", is passed through
After the access entry search " Huawei Technologies " of first layer, it will show that enterprise of the name comprising " Huawei Technologies " of enterprise arranges at next layer
Table.Therefore, for the webpage with above-mentioned layout characteristics, depth-first retrieval mode and breath first search mode are taken
The method being combined scans for webpage URL, provides URL queues so that reptile is most fast to obtain page link.
Specifically, the thought of depth-first search is one figure of traversal as deep as possible, from some vertex of figure, visit
It asks all vertex in figure, and each vertex is made only to be accessed once, this process is called graph traversal.And breadth first search
It is to travel through downwards layer by layer, the place different from depth-first search is that breadth first search can be to avoid past always
Under endless loop.
S3 obtains the object page according to the search strategy.
After search strategy is determined, the object page of information needed is included based on the acquisition of identified search strategy.Root
According to the data type that present invention needs obtain, search is oriented by given URL to improve the recall precision for obtaining data,
Since information is multifarious, it is only necessary to industry relevant information, without traveling through in entire internet.Specifically, obtain object
The mode of the page be using multi-threaded network reptile, in a manner that depth-first retrieval and breath first search are combined come pair
Webpage URL is retrieved, and crawls the URL of one or more object pages, and according to the URL downloaded object pages.Wherein,
Web crawlers uses theme network crawler, is called focused web crawler(Focused Crawler), refer to selectively creep that
A bit with the web crawlers of the theme related pages pre-defined, due to Theme Crawler of Content and general reptile it is most important difference lies in
Theme Crawler of Content only selects and the setting relevant page of theme, it is thus possible to reduce the webpage of time that reptile creeps and required traversal
Quantity improves recall precision.
In another embodiment of the invention, further included according to the search strategy acquisition object page as shown in Figure 2
Step:
S301 using web crawlers, carries out crawling traversal using the initial URL pages as entrance.
Web crawlers can be gathered with multithreading, so can more effectively, capture web page contents more quickly.Preferably, may be used
To be climbed using selectively the creep network of those demand information type related pages with pre-defining of theme network crawler
Worm.It is carried out using crawl need of work of the network of network reptile on Web according to certain policing algorithm, such as depth-first
Algorithm, breadth first algorithm, heuristic search algorithm search algorithm classifiably automatically.The thought of depth-first search is as far as possible
Deep one figure of traversal, from some vertex of figure, accesses all vertex in figure, and makes each vertex only accessed one
Secondary, this process is called graph traversal.And breadth first search is to travel through downwards layer by layer, it is different from depth-first search
Be that it can be to avoid endless loop always down.According to company information data internet page layout design feature:The
One layer has search entrance, is all Enterprise Lists after, can take depth-first search and breadth first search phase
With reference to method webpage URL is scanned for, such crawlers most fast can obtain page link.
S302 analyzes the URL pages crawled and filters duplicate removal.
Internet information is numerous and diverse, and in reptile is used to carry out crawling work, reptile may repeat to be put into be already present on to treat
The URL of troop is climbed, will so reduce the work efficiency of reptile.Therefore to URL during reptile analyzes extraction webpage URL
Carrying out duplicate removal filtration also becomes more and more important, and the contradiction point of this technology is that crawler technology is in itself for depositing
It stores up space and rate request is very high, and the work efficiency of reptile is also inevitably influenced in duplicate removal process.In order to use
De-weight method efficient, without taking much space, while it is also required to ensure the accuracy of duplicate removal, available De-weight method bag
It includes based on database duplicate removal, based on memory duplicate removal and based on Bloom filter(bloom filter)Duplicate removal.
S303 obtains the URL queues after optimization.
Web crawlers is obtained and the URL after filtered duplicate removal is stored in URL queues.
S4 extracts the data message in the page.
The content by the reptile extraction page is needed after the object page is obtained, the Text Feature Extraction of web crawlers is first from right
As the page URL queues in obtain URL addresses, DNS name resolution is carried out to URL addresses, is parsed in URL in Web server
Address with the service of access target server, establish the Socket connections of client and server, then sent to HTTP
Request, to obtain the data of content page HTML.It needs to compile page pretreatment after the HTML of content pages is obtained
Code conversion and Web de-noising.
Since html file is the semi-structured data of self-described, and semi-structured data is difficult to be employed program directly to make
With in order to extract useful information from its file, so its structuring must be made.Structured message extracting method further include as
Step shown in Fig. 3:
S401 is parsed and is generated dom tree.
DOM(Document ObjectModel)Tree is as shown in figure 4, entire chapter web page code is exactly one after dom tree structure
Tree, label is then the node of tree, removes the node unrelated with text during content is extracted, and recurrence or uses other
Algorithm traversal dom tree obtains content node, and extracts wherein content;For notes content, then deletion of node is follow-up from dom tree
Continuous traversal obtains content.By html document be parsed into dom tree can by means of the third party library Beautiful Soup storehouses of Python,
The major function in the storehouse is from webpage capture data, provides the functions such as some functions processing navigation, search, modification parsing tree, also
Document transform coding will can be inputted automatically, provide different parsing strategies and powerful speed to the user.
S402, the structured message based on template extract.
In some embodiments, model customization, the set of custom built forms can be carried out to the content for clearly requiring acquisition
Constitute page decimation rule storehouse, when Extracting Information passes through certain regular expression according to page decimation rule storehouse and carries out webpage
Text message extract, and judge information whether with template matches.Regular expression is can to accomplish pattern match and replace strong
Big function, a pattern match are exactly a character string, and a pattern matching expression is by unitary and binary operator
Composition, space and tab can be used for separating keyword.
The information extraction mode of webpage includes the data extraction based on Wrapper, the data based on machine learning carry
It takes, the data extraction based on HTML construction trees and the data based on Web query are extracted.
Data extraction method based on wrapper carries out the decimation rule or mould that information extraction depends on people to establish by hand
Formula by the way that template analysis are obtained with the position of text in the page, can accurately position body matter, and extract the accurate of text
Rate is high, and extraction rate is fast.However, this method can not handle miscellaneous webpage in Web with unified template, without general
All over applicability, and the rule artificially established also is difficult to ensure whole systemic logicality, meanwhile, some decimation rules are all pins
To some fields fixed setting, have height field correlation and poor portability, generation and maintenance cost compared with
It is high, it is necessary to excessive manual intervention.
Data pick-up based on machine learning is by carrying out Dom achievements, Dom solutions after being pre-processed to web data
Analysis carries out parting operation using the good model of precondition to webpage(Identify the structure of webpage:News, community forum etc.),
Then piecemeal is carried out to webpage according to features such as text size, text position, bookmark names to extract to obtain relevant information, it can
Realize some basic extraction demands such as title, text extracting, structure of web page classification.However, the scheme based on machine learning is only
General, relative coarseness information extraction can be met, accurately field can not be extracted.
Data extraction method based on HTML structure tree needs first to be positioned to the information extracted according to design feature, passes through
The characteristic of Html is set to construct, and decimation rule is formed by the form for forming regular expression, and carrying out operation to tree realizes data
It extracts.
Data pick-up based on Web query is a kind of information extraction using Web as information source, from semi-structured Web
Data are extracted in document, is that data more structuring, semanteme become apparent from, facility is provided for the Web query of user.Based on Web
The information extraction of inquiry is managed and inquired about to the data on internet using database technology, and Web information is extracted and is converted
Web page document is inquired about into the Web query language of standard.
The elementary tactics of above-mentioned four kinds of information extractions is similar, is all first to data prediction, is parsed into dom tree, then root
Structuring extraction is carried out to data according to rule, template or training algorithm, database is stored data into after extracting.For certain
The information in one field carries out concentration extraction, is such as directed to specific company information(Such as judge including various merit events, law
The legal informations such as document and laws and regulations)When carrying out concentrating extraction, relatively numerous and jumbled network data is relatively easy to be formed certain
Decimation rule can comprehensively utilize above-mentioned four kinds of information extraction methods and combinations thereof.
S403 stores the information being drawn into the form of structuring to database.
The data message being drawn into needs to be stored in database, convenient for extraction and application.The present invention is primarily directed to Information Network
Data message on standing, its main feature is that most of data types are relatively uniform, such as essential information of enterprise etc..Selectable number
Include according to storehouse
1.MySQL databases
MySQL is a kind of Relational DBMS of open source code, in general a very handy data
Storehouse, development environment is windows systems, multilingual support, because it is operated and management data convenience, high-performance, low cost, and
Its kernel thread is multithreading, becomes the optimal selection of enterprise's storage data.
2.MongoDB databases
MongoDB is fully functional in non-relational database and very alike with relational database storage system.It is
One pattern is free, and the Document image analysis towards set is compared with MySQL, its data of increasing income have the branch of commercial company
It holds, it is safer.
Since MongoDB databases have very good Data expansion performance, between non-relational database and relation
Between database, for the incomplete situation of data, it can be solved with its expanded function, it still can be by different information point
Document stores.But MongoDB, without ripe maintenance tool as MySQL, this is all a value for exploitation and IT operations
The place that must be paid attention to, and MongoDB occupied spaces are excessive.
In further embodiments, information acquisition method of the invention further includes:
S5 obtains the fresh information of webpage.
Network information renewal speed is very fast, it is necessary to the webpage that reptile regular return visit captured, detection webpage have it is unchanged,
Useless necrosis link is removed, storehouse is updated the data, newest information can be obtained in time convenient for user.
The present invention also provides a kind of data messages to obtain system 100, as shown in Figure 1, including:
Device 110 is retrieved, with information unit 111 and receiving unit 112, for obtaining and the specified letter of information unit 111
The associated data message of manner of breathing.
For example, specified information can be company information input by user, retrieval device 110 is according to specified by user
Enterprise name obtains and the company information needed for the relevant user of the enterprise.The corresponding webpage can be include with it is described
The webpage of company-related information, for example, national credit information of enterprise publicity system, Chinese law court judgement document's net, China's execution
Information discloses net, official website of State Intellectual Property Office, official website of trademark office of State Administration for Industry & Commerce, official of National Copyright Administration of the People's Republic of China net
It stands and recruitment website etc., wherein, different webpages has different targetedly company informations.Specifically, obtain webpage information
Mode can be to obtain the corresponding webpage based on http protocol, and pass through receiving unit 112 and receive the webpage returned
Information.
Selection device 120, the selection device 120 are used for the webpage information obtained according to the retrieval device 110
Included in page layout mode, select search strategy.
It is needing further to crawl the object for including user's information needed in webpage after getting corresponding webpage information
The page, and crawl need of work of the web crawlers on Web is carried out according to certain search strategy algorithm, search strategy algorithm
Generally include following four ergodic algorithm:Depth-priority-searching method, breadth first algorithm, heuristic search algorithm, automatic classification are searched
Rope algorithm.For common company information webpage, page layout generally all has the characteristics that:First layer has retrieval
Entrance is all Enterprise Lists after, such as in national credit information of enterprise publicity systematic search " Huawei Technologies ", is passed through
After the access entry search " Huawei Technologies " of first layer, it will show that enterprise of the name comprising " Huawei Technologies " of enterprise arranges at next layer
Table.Therefore, for the webpage with above-mentioned layout characteristics, depth-first retrieval mode and breath first search mode are taken
The method being combined scans for webpage URL, provides URL queues so that reptile is most fast to obtain page link.
Acquisition device 130 has web crawlers unit 131 and duplicate removal unit 132, for after search strategy is determined, obtaining
Take the object page of the corresponding webpage acquired in the retrieval device 110.
Specifically, the data type obtained as needed, web crawlers unit 131 using given initial URL as entrance, into
Row beam search obtains the recall precision of data to improve, since information is multifarious, it is only necessary to industry relevant information, nothing
It need to travel through in entire internet.The mode that web crawlers unit 131 obtains the object page is using multi-threaded network reptile, is passed through
The mode that depth-first retrieval and breath first search are combined retrieves webpage URL, crawls one or more institutes
The URL of the object page is stated, and according to the URL downloaded object pages, wherein, web crawlers uses theme network crawler, is called focusing
Web crawlers(Focused Crawler), refer to selectively creep those and the net of the theme related pages pre-defined
Network reptile, due to Theme Crawler of Content and general reptile is most important only selects difference lies in Theme Crawler of Content and the setting relevant page of theme
Face, it is thus possible to reduce the webpage quantity of time that reptile creeps and required traversal, improve recall precision.
In another embodiment of the invention, duplicate removal unit 132 is additionally operable to the URL crawled to web crawlers unit 131
The page is analyzed and filters duplicate removal.Specifically, since internet information is numerous and diverse, in reptile is used to carry out crawling work, climb
Worm may repeat to be put into the URL for being already present on Dai Pa troops, will so reduce the work efficiency of reptile, therefore in reptile point
Carrying out duplicate removal filtration to URL during analysis extraction webpage URL also becomes more and more important.In order to using it is efficient,
De-weight method without taking much space, while it is also required to ensure the accuracy of duplicate removal, available De-weight method is included based on number
According to storehouse duplicate removal, based on memory duplicate removal and based on Bloom filter(bloom filter)Duplicate removal.
Further, duplicate removal unit 132 be additionally operable to obtain web crawlers unit and filter duplicate removal through duplicate removal unit 132
URL afterwards is stored in URL queues.
Processing unit 140 has address processing unit 141, connection unit 142, acquiring unit 143 and pretreatment unit
144, for extracting the data message in the page.
Wherein, address processing unit 141 obtains URL addresses from the URL queues of the object page, and DNS is carried out to URL addresses
Domain name mapping, parses service of the address in URL in Web server with access target server, and connection unit 142 establishes visitor
The Socket of family end and server end connections, acquiring unit 143 sends to HTTP and asks, to obtain the number of content page HTML
According to.After the HTML of content pages is obtained, pretreatment unit 144 needs to carry out code conversion and Web de-noising to the page.
Since html file is the semi-structured data of self-described, and semi-structured data is difficult to be employed program directly to make
With in order to extract useful information from its file, processing unit 140 further includes structuring unit 145, to acquiring unit
143 html datas obtained carry out structuring processing and extract required information included in it.
More specifically, structuring unit 145 parses html data and generates dom tree as shown in Figure 4, in extraction content
Remove the node unrelated with text in the process, recurrence or use other algorithms traversal dom tree obtain content node, and extract it
Middle content.In some embodiments, structuring unit can also carry out model customization, customization to the content for clearly requiring acquisition
The set of template constitutes page decimation rule storehouse, passes through certain regular expressions according to the page decimation rule storehouse during Extracting Information
Formula carry out Web page text information extraction, and judge information whether with template matches.
In some embodiments, processing unit 140 further includes policy selection unit 146, for for some field
Specific information carry out concentrate extract when, select appropriate information extraction method.Wherein, specific information can be enterprise's letter
Breath, including legal informations such as various merit events, law judgement document and laws and regulations etc..Information extraction method can be based on
The data extraction method of wrapper, the data extraction method based on machine learning, based on HTML construction tree data extraction method,
Any combination of data extraction method or above method based on Web query.
Further, the information that Information Acquisition System 100 also needs to be drawn into is stored in the form of structuring into database,
In order to extraction and application.The present invention unites primarily directed to the data on information site its main feature is that most of data types are opposite
One, such as essential information of enterprise etc..
In some embodiments, Information Acquisition System further includes updating device 150, for obtaining the fresh information of webpage.
Network information renewal speed is very fast, it is necessary to the webpage that reptile regular return visit captured, detection webpage have it is unchanged,
Useless necrosis link is removed, storehouse is updated the data, newest information can be obtained in time convenient for user.
In another embodiment, the present invention also provides a kind of company information search system 200, obtained including user interface 210, information
System 100 and database 220 are taken, which looks into according to information input by user, such as enterprise name in database 210
It looks for corresponding company information and is exported according to certain preset strategy.Wherein, database 210 can be local data base and/or
Network data base, wherein being stored with the data message acquired in Information Acquisition System 100.Further, for the ease of user more
Add accurately and expeditiously obtain required information, the order when exporting corresponding company information according to the degree of correlation from high to low
Output.
It should be noted that term " comprising " and " having " and their any deformation, it is intended that covering is non-exclusive
Include, be not necessarily limited to clearly arrange for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit gone out, but may include not list clearly or consolidate for these processes, method, product or equipment
The other steps or unit having.In addition, those of ordinary skill in the art be appreciated that realize above-described embodiment whole or
Part steps can be completed by hardware, and relevant hardware can also be instructed to complete by program, and the program can be with
It is stored in a kind of computer readable storage medium, storage medium mentioned above can be read-only memory, disk or CD
Deng.
The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the invention, for the common of fields
For technical staff, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made is any
Modification, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.
Claims (38)
1. a kind of data message acquisition methods, for obtaining data message associated with specify information, which is characterized in that described
Method includes:
According to the corresponding webpage information of the acquisition of information specified;
Search strategy is determined according to the page layout format mode;
The object page is obtained according to the search strategy;
Extract the data message in the page.
2. data message acquisition methods as described in claim 1, which is characterized in that described according to the acquisition of information specified
Corresponding webpage information includes:
The corresponding webpage is obtained based on http protocol, and receives the webpage information of return.
3. the data message acquisition methods as any one of claim 1-2, which is characterized in that the search strategy includes
The combination of depth-first retrieval, breath first search and/or the two.
4. the data message acquisition methods as any one of claim 1-3, which is characterized in that according to the cloth of the webpage
Office's mode determines that search strategy includes:
The page layout includes first layer access entry and second layer information list.
5. data message acquisition methods as described in claim 1, which is characterized in that described that object page is obtained according to search strategy
Face includes:
The URL of one or more object pages is obtained by multi-threaded network reptile and downloads the object page.
6. data message acquisition methods as claimed in claim 5, which is characterized in that the multi-threaded network reptile is focusing
Web crawlers(Focused Crawler).
7. data message acquisition methods according to claim 5, which is characterized in that the acquisition is one or more described right
Deduplication operation is carried out to the URL of the webpage as the URL of the page is further included, the deduplication operation is based on database duplicate removal, base
In memory duplicate removal and/or duplicate removal based on Bloom filter.
8. data message acquisition methods as described in claim 1, which is characterized in that the number in the extraction page
It is believed that breath includes:
The URL addresses in URL queues are obtained, DNS name resolution is carried out to URL addresses, establishes server corresponding with the URL
Socket connections, and send the html data file of the page described in acquisition request, wherein, html data file includes described
Data message.
9. data message acquisition methods as claimed in claim 8, which is characterized in that after the html data file is obtained
Further include the pretreatment that code conversion and denoising are carried out to html file.
10. data message acquisition methods as claimed in claim 9, which is characterized in that the pre- place of the code conversion and denoising
After reason, further include:Html file described in structuring.
11. data message acquisition methods as claimed in claim 10, which is characterized in that html file described in the structuring
Including being parsed to its content and generating DOM(Document Object Model)Tree, removes unrelated node, traversal obtains
The content node taken, to required content customization template.
12. data message acquisition methods as claimed in claim 11, which is characterized in that described to required content customization template
Including the content obtained to needs, by pattern match and progress information extraction is replaced, obtains structured message data.
13. data message acquisition methods as claimed in claim 10, which is characterized in that the method further includes:For described
The field involved by information specified, the strategy of the data message included in html file described in selective extraction.
14. data message acquisition methods according to claim 13, which is characterized in that in the extraction html file
Comprising the data message strategy include the data extraction method based on wrapper, based on machine learning data extraction
Method constructs the arbitrary of the data extraction method set, the data extraction method based on Web query or above method based on HTML
Combination.
15. data message acquisition methods according to claim 1, which is characterized in that the method is further included described in acquisition
The fresh information of webpage, it is described obtain the webpage fresh information the step of captured including regular return visit webpage, detection
Webpage has unchanged, removal necrosis link and/or updates the data storehouse.
16. the data message acquisition methods as described in claim 1-15, which is characterized in that described to designate the information as enterprise's name
Claim, the data message is and the relevant data message of the enterprise.
17. a kind of Information Acquisition System, for obtain with(User)The associated data message of information specified, feature exist
In, the system comprises:
Retrieve device, selection device, acquisition device and processing unit;
Wherein,
The retrieval device further includes information unit, believes for obtaining corresponding webpage according to the specify information of described information unit
Breath;
The selection device is used for page layout mode included in the webpage information obtained according to the retrieval device,
Select search strategy;
The acquisition device is used to obtain the object page of the corresponding webpage acquired in the retrieval unit;
And
The processing unit is used to extract the data message in the page.
18. Information Acquisition System according to claim 17, which is characterized in that the retrieval device is used for based on HTTP
Agreement obtains the corresponding webpage, further includes receiving unit, for receiving the webpage information returned.
19. Information Acquisition System according to claim 17, which is characterized in that the search strategy is examined including depth-first
The combination of rope, breath first search and/or the two.
20. Information Acquisition System according to claim 19, which is characterized in that the page layout is retrieved including first layer
Entrance and second layer information list.
21. Information Acquisition System according to claim 17, which is characterized in that the acquisition device further includes web crawlers
Unit, the web crawlers unit by multi-threaded network reptile obtain one or more corresponding pages URL and under
Carry the corresponding page.
22. Information Acquisition System according to claim 21, which is characterized in that the multi-threaded network reptile is focusing
Web crawlers(Focused Crawler).
23. Information Acquisition System according to claim 21, which is characterized in that the acquisition device further includes duplicate removal list
Member, for carrying out deduplication operation to the URL of the webpage, the deduplication operation is based on database duplicate removal, based on memory duplicate removal
And/or the duplicate removal based on Bloom filter.
24. Information Acquisition System according to claim 17, which is characterized in that the processing unit includes:
For obtaining the URL addresses in URL queues, DNS name resolution is carried out to URL addresses for address processing unit;
Connection unit, for establishing the Socket connections of server corresponding with the URL;
Acquiring unit is used for and sends the html data file of the page described in acquisition request, wherein, html data file includes
The data message.
25. Information Acquisition System according to claim 24, which is characterized in that it is single that the processing unit further includes pretreatment
Member after obtaining the html data file in the acquiring unit, carries out html file code conversion and denoising
Pretreatment.
26. Information Acquisition System according to claim 25, which is characterized in that the processing unit further includes structuring list
Member, for after the pretreatment of the code conversion and denoising, html file described in structuring.
27. Information Acquisition System according to claim 26, which is characterized in that html file bag described in the structuring
It includes and DOM is parsed and generated to its content(Document Object Model)Tree, removes unrelated node, and traversal obtains
Content node, to required content customization template.
28. Information Acquisition System according to claim 27, which is characterized in that described to required content customization template bag
The content obtained to needs is included, by pattern match and progress information extraction is replaced, obtains structured message data.
29. Information Acquisition System according to claim 24, which is characterized in that the processing unit further includes policy selection
Unit, for the field involved by for the information specified, the number included in html file described in selective extraction
It is believed that the strategy of breath.
30. Information Acquisition System according to claim 29, which is characterized in that the strategy includes the number based on wrapper
The data extraction method of tree is constructed according to extracting method, the data extraction method based on machine learning, based on HTML, is looked into based on Web
The data extraction method of inquiry or any combination of above method.
31. Information Acquisition System according to claim 17, which is characterized in that the system also includes updating devices, use
In the fresh information for obtaining the webpage, the fresh information for obtaining the webpage include the webpage that regular return visit captured,
Detection webpage has unchanged, removal necrosis link and/or updates the data storehouse.
32. according to the Information Acquisition System described in claim 17-31, which is characterized in that it is described to designate the information as enterprise name,
The data message be and the relevant data message of the enterprise.
33. a kind of computer-readable medium for being stored with database, the database is used to store business data information, feature
It is to appoint by such as claim 1-16 any one of them method or by claim 17-32 to be the business data information
The information acquired in system described in one.
34. computer-readable medium according to claim 33, which is characterized in that the database is MySQL database
Or MongoDB databases.
35. a kind of company information search system, including database, which is characterized in that the database is such as claim 33-34
Any one of them database, the system according to information input by user are searched and inputted with the user in the database
The corresponding company information of information, and according to it is default strategy export.
36. company information search system according to claim 35, which is characterized in that the default strategy is according to phase
Pass degree Sequential output.
37. such as claim 35-36 any one of them company information search systems, which is characterized in that described input by user
Information is enterprise name.
38. a kind of computer-readable medium is stored thereon with instruction, which is characterized in that described instruction can be read by computer
To perform such as claim 1-16 any one of them information acquisition methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711381367.2A CN108052632B (en) | 2017-12-20 | 2017-12-20 | Network information acquisition method and system and enterprise information search system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711381367.2A CN108052632B (en) | 2017-12-20 | 2017-12-20 | Network information acquisition method and system and enterprise information search system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108052632A true CN108052632A (en) | 2018-05-18 |
CN108052632B CN108052632B (en) | 2022-02-18 |
Family
ID=62130268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711381367.2A Active CN108052632B (en) | 2017-12-20 | 2017-12-20 | Network information acquisition method and system and enterprise information search system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108052632B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033203A (en) * | 2018-06-29 | 2018-12-18 | 大连交通大学 | A kind of feature extraction method for parallel processing towards big data |
CN109657121A (en) * | 2018-12-09 | 2019-04-19 | 佛山市金穗数据服务有限公司 | A kind of Web page information acquisition method and device based on web crawlers |
CN109902217A (en) * | 2019-03-20 | 2019-06-18 | 江苏科技大学 | A kind of crawler software of astronomy data screening and downloading |
CN111274217A (en) * | 2020-01-10 | 2020-06-12 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device, computer equipment and storage medium |
CN111310012A (en) * | 2020-01-21 | 2020-06-19 | 国网安徽省电力有限公司滁州供电公司 | Automatic monitoring and early warning method for enterprise information loss behavior |
CN113157730A (en) * | 2021-04-26 | 2021-07-23 | 中国人民解放军军事科学院国防科技创新研究院 | Civil-military fusion policy information system |
CN113343108A (en) * | 2021-06-30 | 2021-09-03 | 中国平安人寿保险股份有限公司 | Recommendation information processing method, device, equipment and storage medium |
TWI764491B (en) * | 2020-12-31 | 2022-05-11 | 重量科技股份有限公司 | Text information automatically mining method and system |
CN116361362A (en) * | 2023-05-30 | 2023-06-30 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030200205A1 (en) * | 2001-08-23 | 2003-10-23 | Michael Meiresonne | Method, process, and system for searching and identifying sources of goods and/or services over the internet |
US20040111401A1 (en) * | 2002-12-10 | 2004-06-10 | Yuan-Chi Chang | Using text search engine for parametric search |
CN1920817A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Method for multiple resources pools integral parallel search in open websites |
US20080281788A1 (en) * | 2007-05-09 | 2008-11-13 | Ophir Frieder | Hierarchical structured abstract file system |
US20110307479A1 (en) * | 2010-06-10 | 2011-12-15 | Microsoft Corporation | Automatic Extraction of Structured Web Content |
CN102694772A (en) * | 2011-03-23 | 2012-09-26 | 腾讯科技(深圳)有限公司 | Apparatus, system and method for accessing internet web pages |
CN102930059A (en) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | Method for designing focused crawler |
CN103049542A (en) * | 2012-12-27 | 2013-04-17 | 北京信息科技大学 | Domain-oriented network information search method |
US8458227B1 (en) * | 2010-06-24 | 2013-06-04 | Amazon Technologies, Inc. | URL rescue by identifying information related to an item referenced in an invalid URL |
CN104899268A (en) * | 2015-05-25 | 2015-09-09 | 浪潮集团有限公司 | Distributed enterprise information vertical search method |
CN104978408A (en) * | 2015-08-05 | 2015-10-14 | 许昌学院 | Berkeley DB database based topic crawler system |
CN105868327A (en) * | 2016-03-28 | 2016-08-17 | 浪潮软件集团有限公司 | Distributed web crawler capturing method based on different updating strategies |
CN106462645A (en) * | 2016-01-07 | 2017-02-22 | 马岩 | Network information search method and system |
CN106709052A (en) * | 2017-01-06 | 2017-05-24 | 电子科技大学 | Keyword based topic-focused web crawler design method |
-
2017
- 2017-12-20 CN CN201711381367.2A patent/CN108052632B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030200205A1 (en) * | 2001-08-23 | 2003-10-23 | Michael Meiresonne | Method, process, and system for searching and identifying sources of goods and/or services over the internet |
US20040111401A1 (en) * | 2002-12-10 | 2004-06-10 | Yuan-Chi Chang | Using text search engine for parametric search |
CN1920817A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Method for multiple resources pools integral parallel search in open websites |
US20080281788A1 (en) * | 2007-05-09 | 2008-11-13 | Ophir Frieder | Hierarchical structured abstract file system |
US20110307479A1 (en) * | 2010-06-10 | 2011-12-15 | Microsoft Corporation | Automatic Extraction of Structured Web Content |
US8458227B1 (en) * | 2010-06-24 | 2013-06-04 | Amazon Technologies, Inc. | URL rescue by identifying information related to an item referenced in an invalid URL |
CN102694772A (en) * | 2011-03-23 | 2012-09-26 | 腾讯科技(深圳)有限公司 | Apparatus, system and method for accessing internet web pages |
CN102930059A (en) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | Method for designing focused crawler |
CN103049542A (en) * | 2012-12-27 | 2013-04-17 | 北京信息科技大学 | Domain-oriented network information search method |
CN104899268A (en) * | 2015-05-25 | 2015-09-09 | 浪潮集团有限公司 | Distributed enterprise information vertical search method |
CN104978408A (en) * | 2015-08-05 | 2015-10-14 | 许昌学院 | Berkeley DB database based topic crawler system |
CN106462645A (en) * | 2016-01-07 | 2017-02-22 | 马岩 | Network information search method and system |
CN105868327A (en) * | 2016-03-28 | 2016-08-17 | 浪潮软件集团有限公司 | Distributed web crawler capturing method based on different updating strategies |
CN106709052A (en) * | 2017-01-06 | 2017-05-24 | 电子科技大学 | Keyword based topic-focused web crawler design method |
Non-Patent Citations (5)
Title |
---|
R.JEBERSON RETNA RAJ ET AL.: "Implementation of template independent web news extraction approach, noise removal and structured data detection to improve search for location based services", 《2017 INTERNATIONAL CONFERENCE ON POWER AND EMBEDDED DRIVE CONTROL (ICPEDC)》 * |
彭冬等: "面向Web论坛的网络信息获取技术及系统实现", 《计算机工程与科学》 * |
翟东升 编著: "《专利知识挖掘关键技术研究》", 31 January 2013, 北京:知识产权出版社 * |
肖雷: "面向论坛的文本特征提取及分类技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
贾海蕾等: "企业信息门户中网络蜘蛛的设计与实现", 《软件导刊》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033203A (en) * | 2018-06-29 | 2018-12-18 | 大连交通大学 | A kind of feature extraction method for parallel processing towards big data |
CN109657121A (en) * | 2018-12-09 | 2019-04-19 | 佛山市金穗数据服务有限公司 | A kind of Web page information acquisition method and device based on web crawlers |
CN109902217A (en) * | 2019-03-20 | 2019-06-18 | 江苏科技大学 | A kind of crawler software of astronomy data screening and downloading |
CN111274217A (en) * | 2020-01-10 | 2020-06-12 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device, computer equipment and storage medium |
CN111310012A (en) * | 2020-01-21 | 2020-06-19 | 国网安徽省电力有限公司滁州供电公司 | Automatic monitoring and early warning method for enterprise information loss behavior |
TWI764491B (en) * | 2020-12-31 | 2022-05-11 | 重量科技股份有限公司 | Text information automatically mining method and system |
CN113157730A (en) * | 2021-04-26 | 2021-07-23 | 中国人民解放军军事科学院国防科技创新研究院 | Civil-military fusion policy information system |
CN113343108A (en) * | 2021-06-30 | 2021-09-03 | 中国平安人寿保险股份有限公司 | Recommendation information processing method, device, equipment and storage medium |
CN113343108B (en) * | 2021-06-30 | 2023-05-26 | 中国平安人寿保险股份有限公司 | Recommended information processing method, device, equipment and storage medium |
CN116361362A (en) * | 2023-05-30 | 2023-06-30 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
CN116361362B (en) * | 2023-05-30 | 2023-08-11 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Also Published As
Publication number | Publication date |
---|---|
CN108052632B (en) | 2022-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052632A (en) | A kind of method for obtaining network information, system and company information search system | |
US8868621B2 (en) | Data extraction from HTML documents into tables for user comparison | |
CN101971172B (en) | Mobile sitemaps | |
US9613149B2 (en) | Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata | |
DE19718834B4 (en) | Navigation in hypermedia using soft hyperlinks | |
US8380693B1 (en) | System and method for automatically identifying classified websites | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN104715064B (en) | It is a kind of to realize the method and server that keyword is marked on webpage | |
CN110704411A (en) | Knowledge graph building method and device suitable for art field and electronic equipment | |
JP2003076715A (en) | Method and system for retrieving web pages, program and recording medium | |
CN103399862B (en) | Determine the method and apparatus of search index information corresponding to target query sequence | |
CN110019616A (en) | A kind of POI trend of the times state acquiring method and its equipment, storage medium, server | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
KR20170073693A (en) | Extracting similar group elements | |
CN110110171A (en) | Enterprise information searching method, device and electronic equipment | |
Devi et al. | An efficient approach for web indexing of big data through hyperlinks in web crawling | |
WO2007057809A2 (en) | Method of obtaining a representation of a text | |
CN106776640A (en) | A kind of stock information information displaying method and device | |
US20200293581A1 (en) | Systems and methods for crawling web pages and parsing relevant information stored in web pages | |
CN116226494B (en) | Crawler system and method for information search | |
CN100357942C (en) | Mobile internet intelligent information retrieval engine based on key-word retrieval | |
Ganguly et al. | Performance optimization of focused web crawling using content block segmentation | |
CN106326353A (en) | Method and equipment for providing representation information | |
KR100496384B1 (en) | Search engine, search system, method for making a database in a search system, and recording media | |
Färber | Linked Crunchbase: A linked data API and RDF data set about innovative companies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |