CN108052632B - Network information acquisition method and system and enterprise information search system - Google Patents

Network information acquisition method and system and enterprise information search system Download PDF

Info

Publication number
CN108052632B
CN108052632B CN201711381367.2A CN201711381367A CN108052632B CN 108052632 B CN108052632 B CN 108052632B CN 201711381367 A CN201711381367 A CN 201711381367A CN 108052632 B CN108052632 B CN 108052632B
Authority
CN
China
Prior art keywords
information
data
page
retrieval
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711381367.2A
Other languages
Chinese (zh)
Other versions
CN108052632A (en
Inventor
彭帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Lyuyun Technology Co ltd
Original Assignee
Chengdu Lyuyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Lyuyun Technology Co ltd filed Critical Chengdu Lyuyun Technology Co ltd
Priority to CN201711381367.2A priority Critical patent/CN108052632B/en
Publication of CN108052632A publication Critical patent/CN108052632A/en
Application granted granted Critical
Publication of CN108052632B publication Critical patent/CN108052632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a network information acquisition method, a network information acquisition system and an enterprise information search system. According to the invention, data mining in a deep network is completed through a crawler technology and a targeted retrieval strategy, so that a user can obtain a large amount of effective data in a short time, one-to-one query to each independent website is avoided, one-stop information service is provided for the user, and the data acquisition efficiency is improved.

Description

Network information acquisition method and system and enterprise information search system
Technical Field
The invention relates to a method and a system for acquiring network information, in particular to a method and a system for acquiring webpage information based on a crawler system. .
Background
In the current big data era, massive resources on the network enable users to take over and generate massive distributed and easily bought information. For example, if the related information of the enterprise needs to be obtained, the related information can be directly searched through related official websites including a national enterprise credit information public system, a national court officials document network, a Chinese executive information public network, a national intellectual property bureau official website, a national industry and commerce bureau trademark office official website, a national copyright bureau official website, a recruitment network and the like. However, the enterprise information related to the various websites is different, for example, the national enterprise credit information public system includes business license information, main personnel and other information, the official document network mainly aims at the judgment information, the government website usually includes enterprise credit data and winning bid data, and the recruitment network relates to position, wage and other information more. It can be seen that different information originates from different network platforms, and data on the platforms are usually independent and not shared, and if it is desired to obtain relevant information of one or more enterprises in a targeted manner, it is necessary to query through different platforms, which is cumbersome for users.
On the other hand, the business information, recruitment information, referee documents involved, intellectual property information, etc. of the enterprise are in the nature of a deep network whose concept is defined with respect to a surface network, which refers to contents that cannot be acquired by a general search operation. In order to effectively and conveniently acquire required network information and resources, a search engine serves as a common information retrieval tool and becomes an entrance and a platform for a user to access the internet. However, the general search engine has certain limitations, and is often difficult to acquire deep web content, and a large number of web pages which are not concerned by the user are returned, thereby reducing the efficiency of acquiring effective information. And some platforms for providing enterprise information have the problem of untimely update.
Therefore, how to conveniently and quickly acquire comprehensive enterprise information is a problem existing in current network information acquisition, and a method and a system for efficiently acquiring internet data are needed to be provided to realize directional acquisition of enterprise latest information required by a user.
Disclosure of Invention
The traditional method for acquiring data of the enterprise information website has more limitations, firstly, a large amount of complete high-quality data hidden in a deep network cannot be accurately and effectively acquired through a traditional search engine, and secondly, a large amount of system resources can be wasted by adopting a one-to-one traversal information search mode, so that the information acquisition time is too long, and the efficiency is low. In order to solve the above problems, the present invention discloses a method and a system for acquiring network information, and in particular, to a method and a system for acquiring web page information based on a crawler system, which are used for acquiring required enterprise-related information.
In view of the above problems in the information acquisition process, the present invention provides a data information acquisition method for acquiring data information associated with (user) -specified information, the method including: acquiring corresponding webpage information according to the specified information; determining a retrieval strategy according to the layout mode of the webpage; acquiring an object page according to the retrieval strategy; and extracting the data information in the page.
Further, the acquiring the corresponding web page information according to the specified information includes: and acquiring the corresponding webpage based on an HTTP protocol, and receiving the returned webpage information.
Further, the search strategy includes depth-first search, breadth-first search and/or a combination thereof.
Further, the obtaining the object page according to the retrieval policy includes: and acquiring URLs of one or more object pages through a multi-thread web crawler and downloading the object pages.
Further, the extracting the data information in the page includes: and acquiring URL addresses in the URL queue, performing DNS domain name resolution on the URL addresses, establishing Socket connection with a server corresponding to the URL, and sending a request to acquire an HTML data file of the page, wherein the HTML data file contains the data information.
Further, the method further comprises the step of acquiring the update information of the webpage, wherein the step of acquiring the update information of the webpage comprises periodically revisiting the captured webpage, detecting whether the webpage has changes, removing necrotic links and/or updating a database.
And the specified information is a business name, and the data information is data information related to the business.
On the other hand, according to a data information obtaining method proposed by the present invention, a data information obtaining system is also proposed, the system comprising: the device comprises a retrieval device, a selection device, an acquisition device and a processing device; the retrieval device also comprises an information unit, a search unit and a search unit, wherein the information unit is used for acquiring corresponding webpage information according to the specified information of the information unit; the selection device is used for selecting a retrieval strategy according to a webpage layout mode contained in the webpage information acquired by the retrieval device; the acquisition device is used for acquiring the object page of the corresponding webpage acquired by the retrieval unit; and the processing device is used for extracting the data information in the page.
Further, the retrieval device is configured to obtain the corresponding web page based on an HTTP protocol, and further includes a receiving unit configured to receive the returned web page information.
Further, the search strategy includes depth-first search, breadth-first search and/or a combination thereof.
Furthermore, the acquiring device further comprises a web crawler unit, and the web crawler unit acquires the URLs of one or more corresponding pages through a multi-thread web crawler and downloads the corresponding pages.
Further, the processing device further comprises: the address processing unit is used for acquiring URL addresses in the URL queue and performing DNS domain name resolution on the URL addresses; the connection unit is used for establishing Socket connection of the server corresponding to the URL; and the acquisition unit is used for sending an HTML data file for requesting to acquire the page, wherein the HTML data file comprises the data information.
Further, the system further comprises an updating device for acquiring the updating information of the web page, wherein the acquiring of the updating information of the web page comprises periodically revisiting the captured web page, detecting whether the web page has changes, removing necrotic links and/or updating a database.
And the specified information is an enterprise name, and the data information is data information related to the enterprise.
In summary, the data information acquisition method and system disclosed by the invention can complete data mining in a deep network through a crawler technology and a targeted retrieval strategy, so that a user can acquire a large amount of effective data in a short time, one-to-one query to each independent website is avoided, one-stop information service is provided for the user, and the data acquisition efficiency is improved.
Drawings
Fig. 1 illustrates an information obtaining method according to an embodiment of the present invention;
fig. 2 is a diagram illustrating a method for obtaining an object page according to another embodiment of the present invention;
FIG. 3 is a block diagram illustrating a method for structured extraction of page data information according to another embodiment of the present invention;
FIG. 4 is a schematic diagram of a DOM tree provided by an embodiment of the present invention;
FIG. 5 illustrates an information acquisition system provided by an embodiment of the present invention;
fig. 6 is a diagram of an enterprise information search system according to another embodiment of the present invention.
Detailed Description
In order to make the technical scheme of the invention better understood by those skilled in the art, the technical scheme of the invention is completely eliminated and described in the following combined with the specification and the attached drawings which are attached along with the specification. It should be understood that the following detailed description is only a partial embodiment of the present invention, and other embodiments or combinations thereof, which can be obtained by those skilled in the art without inventive skill on the basis of the following detailed description, are within the technical spirit and scope of the present invention.
An embodiment of the present invention provides a data information obtaining method, as shown in fig. 1, including the following steps:
and S1, acquiring the corresponding webpage information according to the specified information.
The data information acquisition method is used for acquiring data information associated with the specified information, for example, enterprise information required by a user related to the enterprise can be acquired according to the name of the enterprise specified by the user. The corresponding web pages may be web pages containing information related to the enterprise, such as a national enterprise credit information public system, a national court officials document network, a Chinese executive information public network, a national intellectual property bureau official website, a national industry and commerce bureau trademark office official website, a national copyright bureau official website, a recruitment network, and the like, wherein different web pages have enterprise information with different pertinence. Specifically, the manner of acquiring the web page information may be to acquire the corresponding web page based on an HTTP protocol and receive the returned web page information.
And S2, determining a retrieval strategy according to the layout mode of the webpage.
After the corresponding webpage information is acquired, an object page containing information required by a user in the webpage needs to be further crawled, crawling work of a Web crawler on the Web needs to be performed according to a certain retrieval strategy algorithm, and the retrieval strategy algorithm generally comprises the following four traversal algorithms: depth-first algorithm, breadth-first algorithm, heuristic search algorithm, and automatic classification search algorithm. For a common enterprise information webpage, the page layout generally has the following characteristics: the first layer is provided with retrieval entries, all enterprise lists are displayed after entering, for example, the national enterprise credit information public system searches for 'Huaqi technology', and after the 'Huaqi technology' is searched through the retrieval entries of the first layer, the enterprise lists with the enterprise names containing the 'Huaqi technology' are displayed on the next layer. Therefore, for the webpage with the layout characteristics, the URL of the webpage is searched by adopting a method of combining a depth-first retrieval mode and a breadth-first retrieval mode, and a URL queue is provided, so that a crawler can obtain the page link most quickly.
Specifically, the idea of depth-first search is to traverse a graph as deeply as possible, starting from a certain vertex of the graph, visit all vertices in the graph, and make each vertex visited only once, which is called graph traversal. While breadth-first search is a downward traversal on a level-by-level basis, it differs from depth-first search in that breadth-first search can avoid dead loops all the way down.
And S3, acquiring the object page according to the retrieval strategy.
After the retrieval strategy is determined, the object page containing the required information is obtained based on the determined retrieval strategy. According to the data types required to be acquired, the method and the device perform directional search through the given URL to improve the retrieval efficiency of the acquired data, and due to the fact that the information is very different, only industry related information is needed, and traversing on the whole Internet is not needed. Specifically, the object page is obtained by adopting a multi-thread web crawler, searching the URL of the web page in a mode of combining depth-first search and breadth-first search, crawling one or more URLs of the object page, and downloading the object page according to the URL. The web Crawler adopts a topic web Crawler, also called a Focused Crawler (Focused Crawler), which is a web Crawler selectively crawling pages related to a predefined topic, and the main difference between the topic Crawler and a general Crawler is that the topic Crawler only selects pages related to a set topic, so that the crawling time of the Crawler and the number of web pages required to be traversed can be reduced, and the retrieval efficiency is improved.
In another embodiment of the present invention, obtaining the object page according to the retrieval policy further includes the steps shown in fig. 2:
s301, using the web crawler to perform crawling traversal by taking the initial URL page as an entrance.
The web crawler can acquire the web page contents in a multi-thread mode, and therefore the web crawler can effectively and quickly capture the web page contents. Preferably, the subject web crawler can be employed to selectively crawl those web crawlers of pages that are associated with predefined types of demand information. Crawling on the Web by using the network Web crawler needs to be performed according to a certain policy algorithm, such as a depth-first algorithm, a breadth-first algorithm, a heuristic search algorithm and an automatic classification search algorithm. The idea of depth-first search is to traverse a graph as deeply as possible, starting from a certain vertex of the graph, accessing all vertices in the graph, and having each vertex accessed only once, which is called graph traversal. While breadth-first search is a downward traversal level-by-level, unlike depth-first search, which avoids dead-loops all the way down. According to the structural characteristics of the layout of the enterprise information data on the internet page: the first layer is provided with search entries, all enterprise lists are obtained after the search entries are made, and the URL of the webpage can be searched by adopting a method combining depth-first search and breadth-first search, so that a crawler program can obtain page links most quickly.
S302, the crawled URL page is analyzed and filtered to remove the heavy matters.
The internet information is complicated, and when crawls by using the crawler, the crawler can repeatedly put in the URL which is in the queue to be crawled, so that the working efficiency of the crawler can be reduced. Therefore, the task of performing deduplication filtering on the URL in the process of analyzing and extracting the URL of the web page by the crawler becomes more and more important, and the technical contradiction of this task is that the crawler technology has high requirements on storage space and speed, and the working efficiency of the crawler is inevitably affected in the deduplication process. In order to employ efficient deduplication methods that do not take up much space, while also ensuring deduplication accuracy, alternative deduplication methods include database-based deduplication, memory-based deduplication, and bloom filter-based deduplication.
S303, acquiring the optimized URL queue.
And storing the URL which is acquired by the web crawler and filtered to be heavy into a URL queue.
S4, extracting the data information in the page.
The method comprises the steps of acquiring a URL (uniform resource locator) address from a URL queue of an object page, performing DNS (domain name system) domain name resolution on the URL address, resolving an address in a Web server in the URL to access the service of a target server, establishing Socket connection between a client and a server, and sending a request to HTTP (hyper text transport protocol) to acquire HTML (hypertext markup language) data of the content page. After the HTML of the content page is obtained, the page preprocessing needs to be coded and denoised.
Since HTML documents are self-describing semi-structured data that is difficult to use directly by an application, they must be structured in order to extract useful information from their documents. The structured information extraction method further includes the steps shown in fig. 3:
s401, analyzing and generating a DOM tree.
A DOM (document object model) tree is shown in fig. 4, the whole webpage code is a tree after the DOM tree is built, the tag is a node of the tree, the node irrelevant to the text is removed in the process of extracting the content, and the DOM tree is traversed by recursion or other algorithms to obtain content nodes and extract the content in the content nodes; and for the annotation content, continuously traversing to obtain the content after deleting the nodes from the DOM tree. The HTML document is analyzed into a DOM tree by means of a Beautiful Soup library of a third-party library of Python, the main functions of the library are to grab data from a webpage, provide functions of function processing navigation, search, modification of an analysis tree and the like, and automatically convert and code input documents, so that different analysis strategies and powerful speed are provided for users.
And S402, extracting the structured information based on the template.
In some embodiments, template customization can be performed on the content which is definitely required to be acquired, a set of customized templates forms a page extraction rule base, webpage text information extraction is performed through a certain regular expression according to the page extraction rule base when information is extracted, and whether the information is matched with the templates is judged. Regular expressions are powerful functions that can do pattern matching and substitution, one pattern matching is a string, one pattern matching expression consists of unary and binary operators, and spaces and tabs can be used to separate keywords.
The information extraction mode of the webpage comprises data extraction based on a wrapper method, data extraction based on machine learning, data extraction based on an HTML construction tree and data extraction based on Web query.
The data extraction method based on the wrapper depends on an extraction rule or a mode manually established by people, the position of the text in the page is obtained by analyzing the template, the text content can be accurately positioned, the accuracy rate of extracting the text is high, and the extraction speed is high. However, this method cannot process a large variety of Web pages in the Web using a uniform template, and has no general applicability, and it is difficult to ensure the overall systematic logic property even if the rules are established manually, and meanwhile, some extraction rules are fixedly set for some fields, and have high field correlation and poor portability, and the generation and maintenance costs are high, and excessive manual intervention is required.
The data extraction based on machine learning is to perform Dom tree building and Dom analysis after preprocessing webpage data, perform typing operation on the webpage by using a pre-trained model (identifying the structure of the webpage, such as news and community forums), and then perform block extraction on the webpage according to the characteristics of text length, text position, label name and the like to obtain related information, so that the extraction requirements of some bases, such as title, text extraction, webpage structure classification and the like, can be realized. However, the scheme based on machine learning can only satisfy general and relatively rough information extraction, and cannot extract precise fields.
The data extraction method based on the HTML structure tree needs to locate information to be extracted according to structural characteristics, construct the tree through the characteristics of Html, form an extraction rule through the form of forming a regular expression, and operate the tree to realize data extraction.
The data extraction based on the Web query is a kind of information extraction taking Web as an information source, and the data is extracted from a semi-structured Web document, so that the data is more structured and the semantic is clearer, and convenience is provided for the Web query of a user. The information extraction based on Web query utilizes a database technology to manage and query data on the Internet, and converts the Web information extraction into query of Web page documents by using a standard Web query language.
The four basic strategies for information extraction are similar, namely firstly preprocessing data, analyzing the preprocessed data into a DOM tree, then performing structured extraction on the data according to rules, templates or training algorithms, and storing the extracted data in a database. When information in a certain field is extracted in a centralized manner, for example, specific enterprise information (for example, legal information including various case events, legal referee documents, laws and regulations and the like) is extracted in a centralized manner, a certain extraction rule is easily formed by relatively complicated network data, and the four information extraction methods and combinations thereof can be comprehensively utilized.
And S403, storing the extracted information into a database in a structured form.
The extracted data information needs to be stored in a database, so that the extraction and utilization are convenient. The invention mainly aims at the data information on the information website, and is characterized in that most data types are relatively uniform, such as basic information of enterprises and the like. The selectable database includes
MySQL database
MySQL is a relational database management system of open source codes, is a very good database in general, and the development environment is a windows system and multi-language support because the system is convenient to operate and manage data, high in performance and low in cost, and a core thread of the system is multi-threaded and becomes the best choice for storing data by enterprises.
MongoDB database
MongoDB is a storage system that is fully functional in non-relational databases and closely resembles relational databases. It is a free-form, aggregate-oriented, document-based database, and its open source data is supported by business companies and is more secure than MySQL.
Because the MongoDB has very good data expansion performance, the MongoDB is between a non-relational database and a relational database, the condition of incomplete data can be solved by using the expansion function of the MongoDB, and different information can still be stored in different documents. However, MongoDB does not have a mature maintenance tool like MySQL, which is a place where development and IT operations are valued and the MongoDB takes up too much space.
In other embodiments, the information acquiring method of the present invention further includes:
and S5, acquiring the update information of the webpage.
The network information updating speed is very fast, and crawlers need to visit the captured web pages regularly to detect whether the web pages change or not, remove useless necrotic links and update a database, so that users can obtain the latest information in time.
The present invention also provides a data information acquiring system 100, as shown in fig. 1, including:
the retrieval means 110 has an information unit 111 and a reception unit 112 for acquiring data information associated with information specified by the information unit 111.
For example, the specified information may be business information input by the user, and the retrieval device 110 acquires business information required by the user related to the business according to the name of the business specified by the user. The corresponding web pages may be web pages containing information related to the enterprise, such as a national enterprise credit information public system, a national court officials document network, a Chinese executive information public network, a national intellectual property bureau official website, a national industry and commerce bureau trademark office official website, a national copyright bureau official website, a recruitment network, and the like, wherein different web pages have enterprise information with different pertinence. Specifically, the manner of acquiring the web page information may be to acquire the corresponding web page based on the HTTP protocol, and receive the returned web page information through the receiving unit 112.
A selecting device 120, where the selecting device 120 is configured to select a search policy according to a web page layout manner included in the web page information acquired by the searching device 110.
After the corresponding webpage information is acquired, an object page containing information required by a user in the webpage needs to be further crawled, crawling work of a Web crawler on the Web needs to be performed according to a certain retrieval strategy algorithm, and the retrieval strategy algorithm generally comprises the following four traversal algorithms: depth-first algorithm, breadth-first algorithm, heuristic search algorithm, and automatic classification search algorithm. For a common enterprise information webpage, the page layout generally has the following characteristics: the first layer is provided with retrieval entries, all enterprise lists are displayed after entering, for example, the national enterprise credit information public system searches for 'Huaqi technology', and after the 'Huaqi technology' is searched through the retrieval entries of the first layer, the enterprise lists with the enterprise names containing the 'Huaqi technology' are displayed on the next layer. Therefore, for the webpage with the layout characteristics, the URL of the webpage is searched by adopting a method of combining a depth-first retrieval mode and a breadth-first retrieval mode, and a URL queue is provided, so that a crawler can obtain the page link most quickly.
The obtaining device 130 has a web crawler unit 131 and a deduplication unit 132, and is configured to obtain the object page of the corresponding web page obtained by the retrieving device 110 after determining the retrieval policy.
Specifically, according to the type of data to be acquired, the web crawler unit 131 performs directional search using a given initial URL as an entry to improve the retrieval efficiency of the acquired data, and since the information varies widely, only industry-related information is needed, and traversal over the entire internet is not needed. The web Crawler unit 131 obtains the object pages by using a multi-thread web Crawler, searches URL of the web pages by a combination of depth-first search and breadth-first search, crawls URL of one or more object pages, and downloads the object pages according to the URL, wherein the web Crawler uses a topic web Crawler and a Focused web Crawler (Focused Crawler), which is a web Crawler selectively crawling pages related to predefined topics.
In another embodiment of the present invention, the deduplication unit 132 is further configured to analyze the URL page crawled by the web crawler unit 131 and filter out the heavy. Specifically, because internet information is complicated, in using the crawler to crawl, the crawler may repeatedly put into the URL that already exists in waiting to crawl team, so just can reduce the work efficiency of crawler, consequently it also becomes more and more important to carry out the heavy filtering work to URL at the in-process that the crawler analyzed and drawed webpage URL. In order to employ efficient deduplication methods that do not take up much space, while also ensuring deduplication accuracy, alternative deduplication methods include database-based deduplication, memory-based deduplication, and bloom filter-based deduplication.
Further, the deduplication unit 132 is further configured to store the URLs obtained by the web crawler unit and filtered out of duplicates by the deduplication unit 132 into the URL queue.
The processing device 140 has an address processing unit 141, a connection unit 142, an acquisition unit 143, and a preprocessing unit 144, and is configured to extract the data information in the page.
The address processing unit 141 obtains a URL address from a URL queue of the target page, performs DNS domain name resolution on the URL address, and resolves an address in a Web server in the URL to access a service of the target server, the connection unit 142 establishes Socket connection between the client and the server, and the obtaining unit 143 sends a request to HTTP to obtain HTML data of the content page. After obtaining the HTML of the content page, the preprocessing unit 144 needs to transcode and denoise the page.
Since the HTML document is self-describing semi-structured data which is difficult to be directly used by an application program, in order to extract useful information from the document thereof, the processing apparatus 140 further includes a structuring unit 145 for performing a structuring process on the HTML data acquired by the acquisition unit 143 and extracting necessary information contained therein.
More specifically, the structuring unit 145 parses the HTML data and generates a DOM tree as shown in fig. 4, removes nodes irrelevant to the body in the process of extracting the content, recursively or using other algorithms to traverse the DOM tree to obtain content nodes, and extracts the content therein. In some embodiments, the structuring unit may further perform template customization on the content that is definitely required to be acquired, a set of customized templates forms a page extraction rule base, and when information is extracted, extracting the text information of the web page according to the page extraction rule base through a certain regular expression, and determining whether the information is matched with the template.
In some embodiments, the processing device 140 further includes a policy selecting unit 146, configured to select an appropriate information extraction method when performing centralized extraction on specific information in a certain domain. The specific information may be enterprise information, including legal information such as various case events, legal referee documents, and laws and regulations. The information extraction method may be a wrapper-based data extraction method, a machine learning-based data extraction method, an HTML construction tree-based data extraction method, a Web query-based data extraction method, or any combination thereof.
Further, the information acquiring system 100 needs to store the extracted information in a structured form in a database to facilitate extraction and utilization. The invention mainly aims at data on an information website and is characterized in that most data types are relatively uniform, such as basic information of enterprises and the like.
In some embodiments, the information acquiring system further comprises an updating device 150 for acquiring the updated information of the web page.
The network information updating speed is very fast, and crawlers need to visit the captured web pages regularly to detect whether the web pages change or not, remove useless necrotic links and update a database, so that users can obtain the latest information in time.
In another embodiment, the present invention further provides an enterprise information searching system 200, which includes a user interface 210, an information obtaining system 100, and a database 220, and the system searches the database 210 for corresponding enterprise information according to information input by a user, such as an enterprise name, and outputs the corresponding enterprise information according to a certain preset policy. The database 210 may be a local database and/or a network database, in which data information acquired by the information acquiring system 100 is stored. Further, in order to facilitate the user to more accurately and efficiently obtain the required information, the corresponding enterprise information is output in the order of high relevance to low relevance.
It should be noted that the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. In addition, those skilled in the art will appreciate that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing associated hardware, where the program is stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (22)

1. A data information acquisition method for acquiring data information associated with specified information, the method comprising:
acquiring corresponding webpage information according to the specified information;
determining a retrieval strategy according to the layout mode of the webpage;
acquiring an object page according to the retrieval strategy;
extracting the data information in the page,
wherein the content of the first and second substances,
extracting the data information in the page further comprises: acquiring URL addresses in a URL queue, performing DNS domain name resolution on the URL addresses, establishing Socket connection with a server corresponding to the URL, and sending a request to acquire an HTML data file of the page, wherein the HTML data file contains the data information;
the extracting of the data information in the page further comprises obtaining an HTML data file of the page, analyzing the content of the HTML data file, generating a DOM (document Object model) tree, removing irrelevant nodes, and traversing the obtained content nodes;
the obtaining of the object page according to the retrieval strategy includes: obtaining URLs of one or more object pages through a multi-thread web crawler and downloading the object pages;
after the HTML data file is obtained, preprocessing of code conversion and denoising is carried out on the HTML file;
after traversing the obtained content nodes, customizing a template for the required content;
the customizing template of the required content comprises the steps of extracting information of the content required to be obtained through pattern matching and replacement to obtain structured information data;
the method further comprises the following steps: selecting a policy for extracting the data information included in the HTML file for a domain to which the designated information relates;
the strategy for extracting the data information contained in the HTML file comprises a data extraction method based on a wrapper, a data extraction method based on machine learning, a data extraction method based on an HTML construction tree, a data extraction method based on a Web query, or any combination thereof.
2. The data information acquisition method according to claim 1, wherein said acquiring corresponding web page information according to the specified information comprises:
and acquiring the corresponding webpage based on an HTTP protocol, and receiving the returned webpage information.
3. The data information acquisition method according to any one of claims 1-2, wherein the retrieval policy includes depth-first retrieval, breadth-first retrieval, and/or a combination thereof.
4. The data information acquisition method of claim 3, wherein determining a retrieval policy according to the layout manner of the web page comprises:
the webpage layout comprises a first-layer retrieval inlet and a second-layer information list.
5. The data information acquisition method of claim 1, wherein the multithreaded web Crawler is a Focused web Crawler (Focused Crawler).
6. The method according to claim 1, wherein the obtaining the URLs of the one or more object pages further includes performing a deduplication operation on the URLs of the web pages, where the deduplication operation is database-based deduplication, memory-based deduplication, and/or bloom filter-based deduplication.
7. The method according to claim 1, further comprising obtaining update information of the web page, wherein the step of obtaining update information of the web page comprises periodically revisiting the crawled web page, detecting a change in the web page, removing necrotic links, and/or updating a database.
8. The data-information obtaining method according to claim 7, wherein the specified information is a name of a business, and the data information is data information related to the business.
9. An information acquisition system for acquiring data information associated with specified information, the system comprising:
the device comprises a retrieval device, a selection device, an acquisition device and a processing device;
wherein the content of the first and second substances,
the retrieval device also comprises an information unit, which is used for acquiring corresponding webpage information according to the specified information of the information unit;
the selection device is used for selecting a retrieval strategy according to a webpage layout mode contained in the webpage information acquired by the retrieval device;
the acquisition device is used for acquiring the object page of the corresponding webpage acquired by the retrieval device;
and the number of the first and second groups,
the processing device is used for extracting the data information in the page,
wherein the processing device further comprises an acquisition unit and a structuring unit,
an acquisition unit for acquiring an HTML data file of the page,
the structuring unit is used for analyzing the content of the HTML data file, generating a DOM (document Object model) tree, removing irrelevant nodes and traversing the acquired content nodes;
the acquisition device also comprises a web crawler unit, wherein the web crawler unit acquires the URLs of one or more corresponding pages through a multi-thread web crawler and downloads the corresponding pages;
the processing apparatus further comprises:
the address processing unit is used for acquiring URL addresses in the URL queue and performing DNS domain name resolution on the URL addresses;
the connection unit is used for establishing Socket connection of the server corresponding to the URL;
wherein the content of the first and second substances,
the acquisition unit is also used for sending a request to the server and acquiring an HTML data file of the page, wherein the HTML data file contains the data information;
the processing device also comprises a preprocessing unit which is used for carrying out preprocessing of code conversion and denoising on the HTML file after the HTML data file is acquired by the acquisition unit;
the structuring unit is also used for customizing a template for the required content;
the customizing template of the required content comprises the steps of extracting information of the content required to be obtained through pattern matching and replacement to obtain structured information data;
the processing device further comprises a policy selection unit, which is used for selecting a policy for extracting the data information contained in the HTML file according to the specified field related to the information;
the strategies include wrapper-based data extraction methods, machine learning-based data extraction methods, HTML construction tree-based data extraction methods, Web query-based data extraction methods, or any combination thereof.
10. The information acquisition system according to claim 9, wherein the retrieval means is configured to acquire the corresponding web page based on an HTTP protocol, and further comprising a receiving unit configured to receive the web page information returned.
11. The information acquisition system according to any one of claims 9 to 10, wherein the retrieval policy includes depth-first retrieval, breadth-first retrieval, and/or a combination thereof.
12. The information acquisition system according to claim 11, wherein the layout of the web page includes a first-tier search entry and a second-tier information list.
13. The information acquisition system according to claim 9, wherein the multithreaded web Crawler is a Focused web Crawler (Focused Crawler).
14. The information acquisition system according to claim 9, wherein the acquisition apparatus further comprises a deduplication unit configured to perform deduplication operations on the URL of the web page, the deduplication operations being database-based deduplication, memory-based deduplication, and/or bloom filter-based deduplication.
15. The information acquisition system according to claim 9, further comprising an updating means for acquiring the updated information of the web page, wherein the acquiring the updated information of the web page comprises periodically revisiting the crawled web page, detecting whether the web page has changed, removing necrotic links, and/or updating a database.
16. The information acquisition system according to claim 15, wherein the specified information is a name of a business, and the data information is data information related to the business.
17. A computer readable medium having stored thereon a database for storing enterprise data information, wherein the enterprise data information is information obtained by the method of any one of claims 1-8 or by the system of any one of claims 9-16.
18. The computer-readable medium of claim 17, wherein the database is a MySQL database or a MongoDB database.
19. An enterprise information search system, comprising a database, wherein the database is the database according to any one of claims 17-18, and the system searches the enterprise information corresponding to the information input by the user in the database according to the information input by the user and outputs the enterprise information according to a preset strategy.
20. The system of claim 19, wherein the predetermined policy is output in order of relevance.
21. The enterprise information search system of any one of claims 19-20, wherein the user-entered information is an enterprise name.
22. A computer-readable medium having stored thereon instructions that are readable by a computer to perform the information acquisition method of any one of claims 1-8.
CN201711381367.2A 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system Active CN108052632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711381367.2A CN108052632B (en) 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711381367.2A CN108052632B (en) 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system

Publications (2)

Publication Number Publication Date
CN108052632A CN108052632A (en) 2018-05-18
CN108052632B true CN108052632B (en) 2022-02-18

Family

ID=62130268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711381367.2A Active CN108052632B (en) 2017-12-20 2017-12-20 Network information acquisition method and system and enterprise information search system

Country Status (1)

Country Link
CN (1) CN108052632B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033203A (en) * 2018-06-29 2018-12-18 大连交通大学 A kind of feature extraction method for parallel processing towards big data
CN109657121A (en) * 2018-12-09 2019-04-19 佛山市金穗数据服务有限公司 A kind of Web page information acquisition method and device based on web crawlers
CN109902217A (en) * 2019-03-20 2019-06-18 江苏科技大学 A kind of crawler software of astronomy data screening and downloading
CN111274217B (en) * 2020-01-10 2023-08-18 深圳前海环融联易信息科技服务有限公司 Method, device, computer equipment and storage medium for data acquisition
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
TWI764491B (en) * 2020-12-31 2022-05-11 重量科技股份有限公司 Text information automatically mining method and system
CN113157730A (en) * 2021-04-26 2021-07-23 中国人民解放军军事科学院国防科技创新研究院 Civil-military fusion policy information system
CN113343108B (en) * 2021-06-30 2023-05-26 中国平安人寿保险股份有限公司 Recommended information processing method, device, equipment and storage medium
CN116361362B (en) * 2023-05-30 2023-08-11 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1920817A (en) * 2006-09-14 2007-02-28 浙江大学 Method for multiple resources pools integral parallel search in open websites
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
US8458227B1 (en) * 2010-06-24 2013-06-04 Amazon Technologies, Inc. URL rescue by identifying information related to an item referenced in an invalid URL
CN106462645A (en) * 2016-01-07 2017-02-22 马岩 Network information search method and system
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7483872B2 (en) * 2001-08-23 2009-01-27 Michael Meiresonne Supplier identification and locator system and method
US7289981B2 (en) * 2002-12-10 2007-10-30 International Business Machines Corporation Using text search engine for parametric search
US7720869B2 (en) * 2007-05-09 2010-05-18 Illinois Institute Of Technology Hierarchical structured abstract file system
US20110307479A1 (en) * 2010-06-10 2011-12-15 Microsoft Corporation Automatic Extraction of Structured Web Content
CN102694772B (en) * 2011-03-23 2014-12-10 腾讯科技(深圳)有限公司 Apparatus, system and method for accessing internet web pages
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
CN104899268A (en) * 2015-05-25 2015-09-09 浪潮集团有限公司 Distributed enterprise information vertical search method
CN104978408A (en) * 2015-08-05 2015-10-14 许昌学院 Berkeley DB database based topic crawler system
CN105868327A (en) * 2016-03-28 2016-08-17 浪潮软件集团有限公司 Distributed web crawler capturing method based on different updating strategies

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1920817A (en) * 2006-09-14 2007-02-28 浙江大学 Method for multiple resources pools integral parallel search in open websites
US8458227B1 (en) * 2010-06-24 2013-06-04 Amazon Technologies, Inc. URL rescue by identifying information related to an item referenced in an invalid URL
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN106462645A (en) * 2016-01-07 2017-02-22 马岩 Network information search method and system
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向论坛的文本特征提取及分类技术研究;肖雷;《中国优秀硕士学位论文全文数据库信息科技辑》;20151231;第2015年卷(第12期);第I138-965页 *

Also Published As

Publication number Publication date
CN108052632A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN108052632B (en) Network information acquisition method and system and enterprise information search system
US8473473B2 (en) Object oriented data and metadata based search
CN109033358B (en) Method for associating news aggregation with intelligent entity
CN102622445B (en) User interest perception based webpage push system and webpage push method
CN102073726B (en) Structured data import method and device for search engine system
US8682881B1 (en) System and method for extracting structured data from classified websites
CA2790421C (en) Indexing and searching employing virtual documents
KR20120101365A (en) Method and system for processing information of a stream of information
US20160103913A1 (en) Method and system for calculating a degree of linkage for webpages
US20110238653A1 (en) Parsing and indexing dynamic reports
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN103226609A (en) Searching method for WEB focus searching system
CN106874502A (en) A kind of method of video search, device and terminal
Devi et al. An efficient approach for web indexing of big data through hyperlinks in web crawling
CN109272436B (en) Policy information management system
US20130311449A1 (en) Identifying Referred Documents Based on a Search Result
Dixit et al. Design of an ontology based adaptive crawler for hidden web
CN110825976B (en) Website page detection method and device, electronic equipment and medium
KR100931772B1 (en) A method of providing website searching service and a system thereof
Khurana et al. Survey of techniques for deep web source selection and surfacing the hidden web content
Dlugolinsky et al. Distributed web-scale infrastructure for crawling, indexing and search with semantic support
Maheswari et al. Algorithm for Tracing Visitors' On-Line Behaviors for Effective Web Usage Mining
CN107463570B (en) Document retrieval/analysis method and device
JP5559725B2 (en) Information retrieval service providing method using web page divided into a plurality of information blocks
Hovad et al. Real-time web mining application to support decision-making process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant