CN106126688B - Intelligent network information acquisition system and method based on WEB content and structure mining - Google Patents

Intelligent network information acquisition system and method based on WEB content and structure mining Download PDF

Info

Publication number
CN106126688B
CN106126688B CN201610499521.5A CN201610499521A CN106126688B CN 106126688 B CN106126688 B CN 106126688B CN 201610499521 A CN201610499521 A CN 201610499521A CN 106126688 B CN106126688 B CN 106126688B
Authority
CN
China
Prior art keywords
url
acquisition
webpage
accessed
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610499521.5A
Other languages
Chinese (zh)
Other versions
CN106126688A (en
Inventor
黄杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Bokastong Information Technology Co ltd
Original Assignee
Xiamen Fun Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Fun Network Technology Co Ltd filed Critical Xiamen Fun Network Technology Co Ltd
Priority to CN201610499521.5A priority Critical patent/CN106126688B/en
Publication of CN106126688A publication Critical patent/CN106126688A/en
Application granted granted Critical
Publication of CN106126688B publication Critical patent/CN106126688B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intelligent network information acquisition system based on WEB content and structure mining, which comprises a protocol processor, a webpage mark extractor connected with the protocol processor, a URL processor, a leading edge analyzer connected with the URL processor, a URL database connected with the webpage mark extractor, and an acquisition monitor connected with the URL database. The system judges the relevance of the webpage and the leisure travel field by utilizing the analysis of the Web content and the hyperlink structure, thereby determining the acquisition sequence and realizing the acquisition of intelligent network information. The invention also discloses an acquisition method, which is used for extracting the metadata in the webpage; when a new URL link is detected, the relevance between the new URL detected in the webpage and the collection subject is analyzed, and a URL list to be accessed is generated; in the acquisition process, a multithreading acquisition process is monitored, and acquisition is optimized by evaluating the acquisition process, so that the identification rate of related webpages is greatly improved, and the whole acquisition process is optimized.

Description

Intelligent network information acquisition system and method based on WEB content and structure mining
Technical Field
The invention relates to the field of data acquisition and processing, in particular to an intelligent network information acquisition method based on WEB content and structure mining.
Background
In the era of network information explosion, the amount of information becomes extremely huge, and it becomes increasingly difficult to search valuable information in information oceans that cover the sky. Therefore, in order to solve this problem, there are many learning methods using some machines, such as a web page ranking method that can be predicted based on a request of a user, and the like. However, even with very complex ranking algorithms, even a better information crawling tool may not be able to retrieve the information valid in the web page without setting the topic index.
With the rapid expansion of WEB information, various WEB-based services are gradually prosperous. As the basis and important components of these information services, WEB information collection is being applied to various applications and studies such as search engines, site structure analysis, page validity analysis, WEB graph evolution, user interest mining, and personalized information acquisition. However, as the demand of people for various information services provided by people is higher and higher, the traditional information collection based on the whole WEB is more and more unconscious, and the traditional information collection based on the whole WEB cannot timely collect enough WEB information and cannot meet the increasing personalized requirements of people.
Disclosure of Invention
The invention aims to solve the technical problem that the correlation between a webpage and the leisure travel field is judged by utilizing the structural analysis of Web content and hyperlinks, so as to determine the acquisition sequence and realize the acquisition of intelligent network information.
The present invention provides an intelligent network information acquisition system, which solves the technical problems, and comprises:
the protocol processor is used for acquiring data in the webpage according to the WEB protocol;
a web page tag extractor connected to the protocol processor for extracting metadata from the web page;
the URL processor is used for detecting a new URL and analyzing the relevance between the detected new URL in the webpage and the acquisition subject; filtering and classifying the new URL according to the relevance analysis result, and then storing the new URL as the URL to be accessed into the front edge analyzer;
the leading edge analyzer is connected with the URL processor and used for storing a URL list to be accessed;
a URL database connected to the web page tag extractor for storing the metadata and URL links in the front edge parser;
the system also comprises an acquisition monitor connected with the URL database and used for formulating and monitoring multi-thread acquisition and optimizing the acquisition by evaluating an acquisition process.
Further, the leading edge analyzer is further configured to,
firstly, initializing a URL as a seed URL list;
secondly, extracting a URL to be acquired from the URL list in each acquisition cycle packet;
extracting a target page corresponding to the URL to be acquired according to an HTTP (hyper text transport protocol);
analyzing the target page and extracting all URL links and information in the target page;
finally, the URL links which are not visited are continuously added into the front edge analyzer.
Further, in the protocol processor, the Web page data is acquired by Web protocol in one or more of HTTP, FTP, Gopher and BBS.
Further, the URL processor is based on a neural network model to collect URL links and page text together as target information.
Further, the URL processor performs a parallel search of WEB pages based on the Hopfield network.
The invention also provides an intelligent network information acquisition method based on the above, which comprises the following steps:
acquiring data in a webpage according to a WEB protocol, extracting metadata in the webpage, and storing the metadata;
when a new URL link is detected, analyzing the relevance between the new URL detected in the webpage and the acquisition subject, filtering and classifying the new URL according to the relevance analysis result, then taking the new URL as a URL to be visited, generating a URL list to be visited, and storing the URL list;
in the acquisition process, the multithreading acquisition process is monitored, and the acquisition is optimized by evaluating the acquisition process.
Furthermore, the specific method for monitoring the multithread acquisition process comprises the following steps:
each thread in the multi-thread acquisition process firstly locks a URL list to be accessed and extracts the next URL from the URL list to be accessed; after the website corresponding to the URL is extracted, unlocking a URL list to be accessed;
and if a new URL is added into the URL list to be accessed, locking the URL list to be accessed again, and unlocking again after the new URL is successfully added.
Furthermore, when the multithread collection is carried out, the captured webpage is used as a log for backup.
Further, the method for analyzing the relevance between the new URL detected in the web page and the collection topic comprises the following steps:
judging the weight of the correlation through the form of an excitation value based on a reinforced learning model, and selecting after learning and optimizing according to the result of the excitation value;
and receiving corresponding incentive value feedback for each behavior in the acquisition thread in the multi-thread acquisition process, and making the process according to the maximized incentive value.
Further, the excitation values are specifically:
initializing a seed URL set, and setting the initial value of all seed weights as 1;
entering next iteration to obtain the weight of each node;
the nodes are pruned and sorted according to the weights,
and traversing the steps until the Web pages with the set threshold are collected.
The invention has the beneficial effects that:
1) compared with the traditional information acquisition system based on the whole WEB, the system for acquiring the WEB information based on the theme aims to provide related pages with better quality and full quantity in a specific theme than the system for acquiring the information based on the whole WEB. Specifically, the method comprises the steps that a URL processor is used for detecting a new URL and analyzing the relevance between the detected new URL in the webpage and an acquisition subject; and filtering and classifying the new URL according to the relevance analysis result, and storing the new URL as the URL to be accessed into a front edge analyzer.
2) The intelligent network acquisition system fully utilizes the advantages of a neural network and parallel computation, and is used for acquiring data in a webpage according to a WEB protocol in a protocol processor; a web page tag extractor connected to the protocol processor for extracting metadata from the web page; detecting a new URL in a URL processor through a reinforced learning technology, and analyzing the relevance between the new URL detected in the webpage and a collection theme; and filtering and classifying the new URL according to the relevance analysis result, and then storing the new URL as the URL to be accessed into the front edge analyzer to calculate the relevance between the captured webpage and the subject. As the full text does not need to be analyzed in the crawling process, the acquisition priority is obtained only by using the hyperlink structure and the meta tag information of the URL page, and the efficiency and the accuracy of information acquisition can be obviously improved.
3) The intelligent network information acquisition method comprises the following steps: acquiring data in a webpage according to a WEB protocol, extracting metadata in the webpage, and storing the metadata; when a new URL link is detected, analyzing the relevance between the new URL detected in the webpage and the acquisition subject, filtering and classifying the new URL according to the relevance analysis result, then taking the new URL as a URL to be visited, generating a URL list to be visited, and storing the URL list; in the acquisition process, the multithreading acquisition process is monitored, and the acquisition is optimized by evaluating the acquisition process. Practice proves that the recognition rate of related webpages can be greatly improved by analyzing the word frequency of keywords in webpage texts and performing weighting and hyperlink analysis on the keywords in titles, keywords and descriptions, so that crawling on the whole Web is avoided, and more related documents with specific topics can be found in effective time.
Drawings
Fig. 1 is a schematic structural diagram of an intelligent network information collection system in an embodiment of the present invention.
Fig. 2 is a schematic process flow diagram of the leading edge analyzer of fig. 1.
Fig. 3 is a schematic flow chart of an intelligent network information collection method according to an embodiment of the present invention.
Fig. 4 is a flowchart illustrating a specific method for monitoring the multithread collection process in fig. 3.
FIG. 5 is a flow chart illustrating a method for analyzing the correlation in FIG. 3.
Fig. 6 is a flowchart illustrating an excitation value obtaining method in fig. 5.
Detailed Description
Fig. 1 is a schematic structural diagram of an intelligent network information collection system in an embodiment of the present invention.
The intelligent network information acquisition system in the embodiment comprises the following structures:
the protocol processor 1 is used for acquiring data in a webpage according to a WEB protocol;
in some embodiments, the task of the protocol processor is to perform protocol processing by all Web protocols.
In some embodiments, Web protocols such as HTTP, FTP, Gopher, and BBS obtain Web page data. The HTTP protocol refers to the Hypertext transfer protocol (HTTP-Hypertext transfer protocol) that defines how a browser requests a web document from a web server and how the server transfers the document to the browser. From a hierarchical point of view, HTTP is a transport-oriented application-layer protocol, which is an important basis for reliably exchanging files (including various multimedia files such as text, sound, and image) on the world wide web. The FTP protocol means that FTP (File transfer protocol) is one of protocols in the TCP/IP protocol suite. The FTP protocol includes two components, one being an FTP server and the other being an FTP client. Wherein the FTP server is used for storing files, and the user can use the FTP client to access resources located on the FTP server through the FTP protocol. The FTP protocol uses by default two of the TCP ports, 20 and 21, where 20 is used to transfer data and 21 is used to transfer control information. Gopher, a well-known information search system on the Internet, organizes files on the Internet into an index that conveniently takes users from one place to another on the Internet. Allowing the user to use menus and files in a hierarchical structure to discover and retrieve information. The Gopher client is connected to the Gopher server and can use the menu structure to display and index other menus, documents, or files. While other applications may be accessed remotely through Telnet. The Gopher protocol enables all Gopher clients on the Internet to talk to all "registered" Gopher servers on the Internet. The BBS, a Bulletin Board System (BBS), runs service software on a computer, allows a user to use a terminal program to connect to the Internet, and performs functions of downloading data or programs, uploading data, reading news, exchanging messages with other users, and the like.
Specifically, the basic steps of the protocol processing in the protocol processor 1 include:
1) and extracting the address and the port number of the target site according to the page URL, and establishing network connection with the address and the port.
2) And (3) assembling an HTTP request head by the page URL, sending the HTTP request head to the target site, if no response message is received within a certain time, terminating capturing the page and discarding the page, otherwise, continuing to the step 3).
3) And analyzing the response message, if the returned status code is 2xx, returning a correct page, and continuing to the step 4). If the status code is 301 or 302, indicating that the page is redirected, extracting a new target URL from the response head, and returning to the step 2); if other state codes are returned, the page connection is failed, and the page is stopped to be fetched and discarded.
In some embodiments, the status code is successful (2XX), and this type of status code, which represents that the request has been successfully received, understood, and accepted by the server, specifically includes:
200 OK
the request has succeeded and the response header or data body desired for the request will be returned with the response.
201 Created
The request has been fulfilled and a new resource has been established as required by the request and its URI has been returned with the Location header. If the required resources cannot be established in time, '202 Accepted' should be returned.
202 Accepted
The server has accepted the request but has not yet processed it. As it may be rejected, eventually the request may or may not be executed. In the case of asynchronous operation, it is not much more convenient than sending this status code. The purpose of the response returned 202 status code is to allow the server to accept requests from other processes (e.g., a batch-based operation that is performed only once per day), without having the client remain connected to the server until the batch operation is fully completed. The response in accepting the request processing and returning 202 the status code should contain some information in the returned entity indicating the current status of the processing, as well as a pointer to a process status monitor or status prediction, so that the user can estimate whether the operation has been completed.
203 Non-Authoritative Information
The server has successfully processed the request, but the returned entity header meta-information is not a deterministic set valid on the original server, but is a copy from a local or third party. The current information may be a subset or superset of the original version. For example, metadata containing resources may result in the origin server knowing the super nature of the meta-information. The use of this status code is not necessary and is only appropriate if a response is returned to 200 OK without using this status code.
204 No Content
The server successfully processes the request but does not need to return any entity content and wants to return updated meta-information. The response may return new or updated meta-information, possibly in the form of an entity header. If such header information is present, it should correspond to the requested variable.
If the client is a browser, the user's browser should retain the page that sent the request without making any changes in the document view, even though new or updated meta-information should be applied to the document in the user's browser active view as per the specifications.
Since the 204 response is prohibited from containing any message body, it always ends with the first empty line after the message header.
205 Reset Content
The server successfully processes the request and does not return anything. But unlike the 204 response, the response returning this status code requires the requestor to reset the document view. The response is primarily used to accept the user input and immediately reset the form so that the user can easily begin another input.
As with the 204 response, the response is also prohibited from containing any message body and ends with the first empty line after the message header.
206 Partial Content
The server has successfully processed the partial GET request. HTTP download tools like FlashGet or thunderbolt use such responses to implement breakpoint continuous transmission or to break down a large document into multiple download segments for simultaneous download.
207 Multi-Status
The state code, extended by WebDAV (RFC 2518), represents that the body of the message that follows will be an XML message and may contain a series of independent response codes, depending on the number of previous sub-requests.
Status code 301
When a user or a search engine sends a browsing request to a website server, one of the status codes in header information (header) in an HTTP data stream returned by the server indicates that the webpage is permanently transferred to another address.
Status code 302
The requesting resource temporarily responds to the request from a different URI. Since such redirection is temporary, the client should continue to send subsequent requests to the original address. This response is cacheable only if specified in Cache-Control or Expires.
4) And extracting page information such as date, length, page type and the like from the response header.
5) And reading the webpage content, and ensuring the integrity of the webpage content by adopting a method of reading in blocks and splicing for the webpage with larger length.
A web page tag extractor 2 connected to the protocol processor for extracting metadata from the web page; the main task of the web page tag extractor 2 is to extract metadata (e.g., title, summary, etc.) on the web page. And stores the metadata in the URL database 5.
A URL processor 3 for detecting a new URL and analyzing the relevance between the detected new URL and the collection subject in the web page; filtering and classifying the new URL according to the relevance analysis result, and then storing the new URL as the URL to be accessed into the front edge analyzer; the new URL is URL information generated according to time.
In some embodiments, the relevance to the collection topic is determined according to the parent page information, and if the relevance between the content of the parent page and the topic is high, the relevance between the link contained in the parent page and the topic may also be high.
In some embodiments, the relevance to the collection topic is determined according to the URL address and the related attribute, a topic word related to the topic is generally used in the URL of a page that reflects the topic to distinguish from other pages, and the attributes such as title and name in the URL tag are also important for the topic identification of the link.
In some embodiments, the association with the collection topic is determined according to sibling links, the sibling links refer to pages in the same webpage and in the same content, and if the content of the page pointed by one sibling link is related to the topic, the content of the page pointed by the URL may be related to the topic.
In some embodiments, new URLs are filtered and categorized by topic, and by keyword.
In some embodiments, storing the URL to be visited in the front edge analyzer means storing the URL to be visited in the list to be visited.
In some embodiments, the URLs to be accessed are stored in a FIFO first-in-first-out queue in a leading edge analyzer.
A front edge analyzer 4 connected with the URL processor and used for storing a URL list to be accessed; when the frontier analyzer (frontier) is used to store the URL list to be visited, firstly, a URL provided by a user or other program needs to be initialized and put into the frontier analyzer as a seed URL list, then, each acquisition cycle packet needs to extract a next URL from the list of the frontier analyzer, extract a target page corresponding to the URL through an HTTP protocol, parse the target page to extract all URL links and specific information therein, and finally, add the URL links that are not visited into the frontier analyzer. Thus, the front edge analyzer is a backlog list of the information gathering system that contains all URL links to be visited.
In some embodiments, the list of URLs to be visited is stored including, but not limited to, seed URLs.
In some embodiments, the seed URL refers to a user-specified URL, such as a URL corresponding to a web portal, to which the user specifies access.
A URL database 5 connected to the web page tag extractor for storing the metadata and URL links in the front edge parser; in the process of searching the Web, the information acquisition system in this embodiment stores the page data, such as the subject, abstract, URL, extracted meta information, etc., of the document, which has undergone repeated content detection, in the URL database 5 for use by other applications.
In some embodiments, the data is compressed before being stored in the URL database 5 due to the large amount of data.
In some embodiments, the compression means includes, but is not limited to, MD5 compression algorithm, ShortUrl algorithm, URL chinese parameter compression algorithm.
And the acquisition monitor 6 is connected with the URL database and used for formulating and monitoring multi-thread acquisition and optimizing the acquisition by evaluating an acquisition process. Since the most prominent problem for the information collection process using machine learning is the update of information, the number and content of internet sites and pages are dynamically changed at any time. It is therefore a very realistic and significant problem to establish an efficient content update mechanism and change control mechanism. Another non-negligible fact is that the information collection system will make the best use of the resources of the server as possible while occupying a large bandwidth, so that it needs to be controlled and monitored during operation, and the collection monitor 6 also plays a role.
Compared with the traditional information acquisition system based on the whole WEB, the system for acquiring the WEB information based on the theme aims to provide related pages with better quality and full quantity in a specific theme than the system for acquiring the information based on the whole WEB. Specifically, the method comprises the steps that a URL processor is used for detecting a new URL and analyzing the relevance between the detected new URL in the webpage and an acquisition subject; and filtering and classifying the new URL according to the relevance analysis result, and storing the new URL as the URL to be accessed into a front edge analyzer.
Fig. 2 is a schematic process flow diagram of the leading edge analyzer of fig. 1.
In the present embodiment, the processing flow in the leading edge analyzer 4 includes the following steps:
step S201, firstly, initializing a URL as a seed URL list;
step S202, extracting a URL to be acquired from the URL list in each acquisition cycle packet;
step S203, extracting a target page corresponding to the URL to be acquired according to an HTTP (hyper text transport protocol); the result in step S203 is a corresponding target page of the seed URL or filtered URL collection, which may include other URL links and needs to be extracted again in step S204.
Step S204, analyzing the target page and extracting all URL links and information in the target page;
in some embodiments, all of the URL links include, but are not limited to, URL information in a parent page, URL information in a sibling page, and the like.
Step S205 finally continues to add the URL links that have not been visited to the frontier analyzer, and proceeds to the URL of the next visit that continues.
In some embodiments, in the protocol processor 1, the Web page data is acquired by a Web protocol in at least one or more of HTTP, FTP, Gopher and BBS, and then stored in the leading edge analyzer 4.
In some embodiments, the URL processor 3 is based on a neural network model to collect URL links and page text together as target information.
Preferably, the URL links are compressed according to the ShortUrl algorithm.
Preferably, the page text is classified by topic.
Preferably, the page text is stored according to keywords.
In some embodiments, the URL handler performs a parallel search of WEB pages based on the Hopfield network.
Fig. 3 is a schematic flow chart of an intelligent network information collection method according to an embodiment of the present invention.
The acquisition method in the embodiment includes the following steps:
step S301, acquiring data in a webpage according to a WEB protocol, extracting metadata in the webpage, and storing the metadata; as will be apparent to those skilled in the art, the metadata in step S301 includes, but is not limited to: { name, title, abstract }; { title, category, attribute }; { abstract, name, category }; { keyword, name }; { date, length, page type }. By analyzing the word frequency of the keywords in the webpage text, and by performing weighting and hyperlink analysis on the titles, the keywords and the keywords in the description, the identification rate of the related webpage can be greatly improved.
When detecting a new URL link in step S302, analyzing the relevance between the new URL detected in the above-mentioned Web page and the collection topic, and filtering and classifying the new URL according to the relevance analysis result, in step S302, based on a neural network model, the neural network can conveniently represent the organization structure of the Web network (especially the hyperlink between Web pages), wherein the network node represents a Web page, and the link strength between nodes represents the relevance strength between the pages PB and PA in hyperlink relationship with the current page PA. In addition, the neural network has the greatest characteristic of good parallel computing and searching characteristics, so that the parallel searching of the information acquisition system on the network can be conveniently realized. It is emphasized that the collection system collects URL link mining and page text mining together as target information.
In the embodiment, the Web network refers to a single-layer Hopfield network with weights, knowledge and information are stored on single-layer interconnected neurons (nodes) and weighted synapses, and the nodes are activated in parallel and propagated back and forth (traversed) through a parallel mitigation method of the network until a stable state is reached, so that the characteristics of parallel computing and self-learning capability of the Hopfield network are fully utilized.
Step S303, then, generating a URL list to be visited by taking the new URL as the URL to be visited, and storing the URL list; in step S303, the URL list to be accessed is stored according to a FIFO first-in first-out queue.
Step S303 further includes retaining a shared history data structure for quickly searching the crawled web pages.
Step S304, in the acquisition process, the multithreading acquisition process is monitored, and the acquisition is optimized by evaluating the acquisition process. In step S304, a multithreading technique is adopted, and each thread follows a well-established acquisition cycle sequence, so that the acquisition speed can be effectively increased and the bandwidth can be efficiently utilized. Each thread first locks the front edge parser 4 and extracts the next URL from the to-be-accessed list. After the web address is fetched, the front edge analyzer is unlocked to allow access by other threads. Once a new URL needs to be added to the to-be-accessed list, the front edge analyzer 4 is locked again, and is unlocked after the addition is successful. The step of locking the leading edge analyzer 4 is important to ensure synchronization of the sequence of multi-threaded acquisition cycles.
In some embodiments, multithreading crawls data for a set number of days, pushing back from the current date.
Fig. 4 is a flowchart illustrating a specific method for monitoring the multithread collection process in fig. 3.
The specific method for monitoring the multithreading collection process in the embodiment comprises the following steps:
step S401, each thread in a multi-thread acquisition process firstly locks a URL list to be accessed and extracts the next URL from the URL list to be accessed;
step S402, after the website corresponding to the URL is extracted, unlocking a URL list to be accessed;
is there a new URL in step S403?
In step S404, if a new URL is added to the URL list to be accessed, the URL list to be accessed is locked again, and the URL list is unlocked again after the new URL is successfully added.
Through the steps, the synchronization of the multithreading collection circulation sequence is ensured.
FIG. 5 is a flow chart illustrating a method for analyzing the correlation in FIG. 3.
The method for analyzing the relevance in the embodiment comprises the following steps:
step S501, judging the weight of the correlation through the form of an excitation value based on a reinforcement learning model, and selecting after learning and optimizing according to the result of the excitation value;
step S502, receiving corresponding incentive value feedback for each behavior in the acquisition thread in the multithread acquisition process, and making the process according to the maximized incentive value.
Reinforcement Learning (Learning) refers to a system framework that can self-learn the best decision from rewards or penalties. And judging and selecting in the reinforced learning model according to the pre-designed business logic. However, once it has produced the decision, the evaluator module can tell the system how good or bad the choice was made in the form of the "reward" (Q-value) scalar, and the system learns and optimizes the choice after that based on the "reward" (Q-value) outcome. Taking an example of an acquisition task, an acquisition thread represents an Agent, when the acquisition thread performs an action once, the acquisition thread receives corresponding Q value feedback, accordingly, the action of the acquisition thread is formulated according to a strategy of maximizing the Q value, and the earlier the action according with the strategy is performed, the larger the "reward", namely the Q value, is, the longer the response time is, even if the same action is performed, the smaller the "reward" becomes.
The Reinforcement Learning model (Reinforcement Learning) includes the following steps: markov MDP decision process-value iteration and strategy iteration method-parameter estimation in MDP.
In this embodiment, the information acquisition system based on the theme uses the reinforcement learning model technology, which not only can define an optimal solution in the system framework, but also can measure the system performance based on the response duration.
In some embodiments, the reinforcement-based learning model is topic-based. The topics include, but are not limited to, topics classified by URL, set fixed topics, and the like.
Fig. 6 is a flowchart illustrating an excitation value obtaining method in fig. 5.
The excitation value obtaining method in this embodiment is as follows:
step S601, initializing a seed URL set, and setting the initial value of all seed weights as 1; setting the initial value of the seed URL to 1;
step S602, entering next iteration to obtain the weight of each node; obtaining the weight of each node in the Hopfield network;
step S603 prunes and sorts the nodes according to the weights,
step S604 goes through the above steps until a Web page for which a threshold is set is collected.
In particular, the amount of the solvent to be used,
a seed URL set I is initialized. Firstly, setting the initial value of all seed weights as 1, u (t) i represents the weight of the node i in the t iteration, namely:
Figure BDA0001035204260000171
the crawler acquires and analyzes the seed Web pages in iteration 0, finds the hyperlinks above, and adds newly found URLs to the network. The selection of the seed set is mainly given by the user, which is the prior knowledge of the user, and can also be obtained by the most relevant results returned by other search engines. For example, URLs of web portals, such as URLs of travel topic websites, etc.
(2) Entering the next iteration, the weight of each node can be obtained by the following iteration formula:
Figure BDA0001035204260000172
wherein, BiIs the set of all parents of node i, Qj,iRefers to the reward obtained by crawling from node j to node i, using the Q-incentive value. The Q value represents the correlation of the content of page i with the target domainAnd the analysis of the Web text content is reflected. The summation process reflects the analysis of the Web hyperlink structure, and since the topic relevance of the Web page is transitive, i.e. the Web page is related to the topic, the child Web page pointed to by the hyperlink is also likely to be related to the topic, so the relevance of the parent Web page is transmitted to the child Web page through the hyperlink. And one web page is referred to by a plurality of other web pages, which shows that the more parent web pages are related to the web page, thereby leading to the enhancement of the relevance of the sub-web pages. The iterative formula embodies exactly the above rules. f. ofsIs a Sigmoid transfer function and is characterized in that: has upper and lower bounds, monotonically increasing, continuous and smooth (i.e., differentiable). The usual Sigmoid function is in logarithmic form,
Figure BDA0001035204260000173
the weights of the nodes may be normalized to between the intervals (0, 1) so that the weights of all nodes of the first layer are calculated.
(3) And deleting and sorting the nodes. First, for those nodes whose weights are less than a certain threshold θ, it is stated that the correlation with the target domain is too small or not at all, and the deletion is performed. Nodes with weights greater than the threshold θ are retained in preparation for the next iteration. Then, the node sets are sorted according to the descending order of the node weights and are put into a URL queue, and the order determines the priority order of the crawler in the next crawling, so that the method is of great importance to the effectiveness of the whole algorithm. When the similarity is calculated, the selection of the threshold value theta is important, if the threshold value theta is too large, the content of the acquired page is very accurate, the precision ratio is high, but the acquired page is too few, the related page is easy to miss, and the recall ratio is reduced. If the threshold θ is too small, the recall ratio will be increased and the precision ratio will be decreased. Through experimental analysis, it is preferable that the threshold value θ should be generally less than 0.5 in the present embodiment, and therefore, it is preferable that θ be 0.25.
(4) And (4) ending conditions. The above process is repeated until a sufficient number of Web pages are collected or until after one iteration, the average weight of all nodes is less than the maximum allowable error.
Those of ordinary skill in the art will understand that: the present invention is not limited to the above embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. Intelligent network information acquisition system, its characterized in that includes:
the protocol processor is used for acquiring data in the webpage according to the WEB protocol;
in the protocol processor, acquiring webpage data through one or more Web protocols in HTTP, FTP, Gopher and BBS, and a webpage mark extractor connected with the protocol processor for extracting metadata in the webpage; the URL processor is used for detecting a new URL and analyzing the relevance between the detected new URL in the webpage and the acquisition subject; filtering and classifying the new URL according to the relevance analysis result, and then storing the new URL as the URL to be accessed into the front edge analyzer; the URL processor is based on a neural network model and used for collecting URL links and page texts together as target information, and the URL processor is based on a Hopfield network and used for carrying out parallel search on WEB pages;
the leading edge analyzer is connected with the URL processor and used for storing a URL list to be accessed;
a URL database connected to the web page tag extractor for storing the metadata and URL links in the front edge parser;
the system also comprises an acquisition monitor connected with the URL database and used for formulating and monitoring multi-thread acquisition and optimizing the acquisition by evaluating an acquisition process;
the specific method for monitoring the multithreading acquisition process comprises the following steps:
each thread in the multi-thread acquisition process firstly locks a URL list to be accessed and extracts the next URL from the URL list to be accessed; after the website corresponding to the URL is extracted, unlocking a URL list to be accessed;
if a new URL is added into the URL list to be accessed, locking the URL list to be accessed again, and unlocking again after the new URL is added successfully;
the method for analyzing the relevance between the new URL detected in the webpage and the collection subject comprises the following steps:
judging the weight of the correlation through the form of an excitation value based on a reinforced learning model, and selecting after learning and optimizing according to the result of the excitation value;
and receiving corresponding incentive value feedback for each behavior in the acquisition thread in the multi-thread acquisition process, and making the process according to the maximized incentive value.
2. The intelligent network information acquisition system of claim 1 wherein the frontier analyzer is further configured to first initialize URLs as a seed URL list;
secondly, extracting a URL to be acquired from the URL list in each acquisition cycle packet;
extracting a target page corresponding to the URL to be acquired according to an HTTP (hyper text transport protocol);
analyzing the target page and extracting all URL links and information in the target page;
finally, the URL links which are not visited are continuously added into the front edge analyzer.
3. An intelligent network information collection method for use in the intelligent network information collection system of claim 1, the method comprising the steps of:
acquiring data in a webpage according to a WEB protocol, extracting metadata in the webpage, and storing the metadata;
when a new URL link is detected, analyzing the relevance between the new URL detected in the webpage and the acquisition subject, filtering and classifying the new URL according to the relevance analysis result, then taking the new URL as a URL to be visited, generating a URL list to be visited, and storing the URL list;
in the collecting process, monitoring a multithreading collecting process, and simultaneously optimizing the collection by evaluating the collecting process;
the specific method for monitoring the multithreading acquisition process comprises the following steps:
each thread in the multi-thread acquisition process firstly locks a URL list to be accessed and extracts the next URL from the URL list to be accessed; after the website corresponding to the URL is extracted, unlocking a URL list to be accessed;
if a new URL is added into the URL list to be accessed, locking the URL list to be accessed again, and unlocking again after the new URL is added successfully;
the method for analyzing the relevance between the new URL detected in the webpage and the collection subject comprises the following steps:
judging the weight of the correlation through the form of an excitation value based on a reinforced learning model, and selecting after learning and optimizing according to the result of the excitation value;
and receiving corresponding incentive value feedback for each behavior in the acquisition thread in the multi-thread acquisition process, and making the process according to the maximized incentive value.
4. The intelligent network information collection method according to claim 3, wherein the captured web pages are backed up as a log during the multi-thread collection.
5. The intelligent network information collection method according to claim 3, wherein the incentive values are specifically:
initializing a seed URL set, and setting the initial value of all seed weights as 1;
entering next iteration to obtain the weight of each node;
the nodes are pruned and sorted according to the weights,
and traversing the steps until the Web pages with the set threshold are collected.
CN201610499521.5A 2016-06-29 2016-06-29 Intelligent network information acquisition system and method based on WEB content and structure mining Expired - Fee Related CN106126688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610499521.5A CN106126688B (en) 2016-06-29 2016-06-29 Intelligent network information acquisition system and method based on WEB content and structure mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610499521.5A CN106126688B (en) 2016-06-29 2016-06-29 Intelligent network information acquisition system and method based on WEB content and structure mining

Publications (2)

Publication Number Publication Date
CN106126688A CN106126688A (en) 2016-11-16
CN106126688B true CN106126688B (en) 2020-03-24

Family

ID=57284888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610499521.5A Expired - Fee Related CN106126688B (en) 2016-06-29 2016-06-29 Intelligent network information acquisition system and method based on WEB content and structure mining

Country Status (1)

Country Link
CN (1) CN106126688B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108173854B (en) * 2017-12-28 2020-12-29 广东电网有限责任公司东莞供电局 Safety monitoring method for power private protocol
CN109542756B (en) * 2018-09-29 2023-04-11 中国平安人寿保险股份有限公司 Method and device for automatically configuring script, electronic equipment and storage medium
CN109409592B (en) * 2018-10-15 2021-08-24 浙江工业大学 Optimal strategy solution method of mobile robot in dynamic environment
CN109614534B (en) * 2018-11-29 2021-08-17 武汉大学 Focused crawler link value prediction method based on deep learning and reinforcement learning
CN109766501B (en) * 2019-01-14 2021-08-17 北京搜狗科技发展有限公司 Crawler protocol management method and device and crawler system
CN111339388B (en) * 2019-06-13 2021-07-27 海通证券股份有限公司 Information crawling system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075259A (en) * 2006-05-16 2007-11-21 索尼株式会社 Acquisiting metadata with public network
CN101101600A (en) * 2007-07-10 2008-01-09 北京大学 Metadata automatic extraction method based on multiple rule in network search
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101159043A (en) * 2007-11-19 2008-04-09 中国科学院计算技术研究所 System and method for visible sensation target context spatial relationship encode
CN101261634A (en) * 2008-04-11 2008-09-10 哈尔滨工业大学深圳研究生院 Studying method and system based on increment Q-Learning
CN101441662A (en) * 2008-11-28 2009-05-27 北京交通大学 Topic information acquisition method based on network topology
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN103617229A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for establishing relevant-webpage data base
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN105677921A (en) * 2016-03-18 2016-06-15 上海珍岛信息技术有限公司 Method and system for acquiring Internet public opinion data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003281191A (en) * 2002-03-20 2003-10-03 Fujitsu Ltd Retrieval server and retrieval result providing method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075259A (en) * 2006-05-16 2007-11-21 索尼株式会社 Acquisiting metadata with public network
CN101101600A (en) * 2007-07-10 2008-01-09 北京大学 Metadata automatic extraction method based on multiple rule in network search
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN101159043A (en) * 2007-11-19 2008-04-09 中国科学院计算技术研究所 System and method for visible sensation target context spatial relationship encode
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101261634A (en) * 2008-04-11 2008-09-10 哈尔滨工业大学深圳研究生院 Studying method and system based on increment Q-Learning
CN101441662A (en) * 2008-11-28 2009-05-27 北京交通大学 Topic information acquisition method based on network topology
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN103617229A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Method and device for establishing relevant-webpage data base
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN105677921A (en) * 2016-03-18 2016-06-15 上海珍岛信息技术有限公司 Method and system for acquiring Internet public opinion data

Also Published As

Publication number Publication date
CN106126688A (en) 2016-11-16

Similar Documents

Publication Publication Date Title
CN106126688B (en) Intelligent network information acquisition system and method based on WEB content and structure mining
Pantl et al. Crawling the web
US6601061B1 (en) Scalable information search and retrieval including use of special purpose searching resources
US7827191B2 (en) Discovering web-based multimedia using search toolbar data
US8799262B2 (en) Configurable web crawler
Yu et al. Summary of web crawler technology research
US7739270B2 (en) Entity-specific tuned searching
US8429110B2 (en) Pattern tree-based rule learning
US7640488B2 (en) System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages
US7908234B2 (en) Systems and methods of predicting resource usefulness using universal resource locators including counting the number of times URL features occur in training data
US20100023508A1 (en) Search engine enhancement using mined implicit links
US9529911B2 (en) Building of a web corpus with the help of a reference web crawl
Wahsheh et al. A link and content hybrid approach for Arabic web spam detection
Dohare et al. Novel web usage mining for web mining techniques
Kumar et al. Framework for distributed semantic web crawler
Peshave et al. How search engines work: And a web crawler application
Bhatt et al. Focused web crawler
KR102169143B1 (en) Apparatus for filtering url of harmful content web pages
US20040205049A1 (en) Methods and apparatus for user-centered web crawling
Agrawal et al. A survey on content based crawling for deep and surface web
Chimphlee et al. Using association rules and markov model for predit next access on web usage mining
Bamrah et al. Web forum crawling techniques
Saranya et al. A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval
Dahiwale et al. Intelligent web crawler
Hussein et al. An Effective Web Mining Algorithm using Link Analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230407

Address after: Room 1003, No. 26 Jinqiao Road, Siming District, Xiamen City, Fujian Province, 361012

Patentee after: Xiamen Bokastong Information Technology Co.,Ltd.

Address before: Room C2202, No. 97, Huizhan Nanli, Siming District, Xiamen City, Fujian Province, 361001

Patentee before: XIAMEN QUCHU NETWORK TECHNOLOGY CO.,LTD.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200324