CN107908698B

CN107908698B - Topic web crawler method, electronic device, storage medium and system

Info

Publication number: CN107908698B
Application number: CN201711071026.5A
Authority: CN
Inventors: 石忠民; 徐叶强; 钟力; 殷长涛
Original assignee: GUANGZHOU SUMMBA INFORMATION TECHNOLOGY CO LTD
Current assignee: Jiaxing Xiaoda Intelligent Technology Co.,Ltd.
Priority date: 2017-11-03
Filing date: 2017-11-03
Publication date: 2021-04-13
Anticipated expiration: 2037-11-03
Also published as: CN107908698A

Abstract

The invention provides a topic web crawler method, which comprises the steps of obtaining the initial URL of a topic crawler, loading the initial URL into a seed task queue, obtaining the initial URL in sequence from the task queue by the topic crawler for crawling, downloading a network document corresponding to the initial URL to the local, performing topic relevance classification on the network document, obtaining a topic relevant text, and storing the topic relevant text into a data warehouse in a structured manner; the present invention relates to an electronic device and a readable storage medium for performing a subject web crawler method; the invention also relates to a topic web crawler system; the word vectors of the text word segmentation result are obtained by adopting a CBOW model of word2vec, the PCA principal component analysis algorithm is adopted to reduce the dimension of the word vectors, the LSTM model is adopted to classify the reduced dimension word vectors, the crawled documents are classified firstly, and then the documents are selectively stored according to specific topics, so that the correlation degree of the topics and the crawled documents is improved, and the coverage rate of a search engine is increased.

Description

Topic web crawler method, electronic device, storage medium and system

Technical Field

The invention relates to the technical field of web crawlers, in particular to a method, electronic equipment, a storage medium and a system for topic web crawlers.

Background

The rapid growth of the network scale brings people with abundant information and brings great challenges to the retrieval of the information, the web crawler is a program for automatically browsing the network, or a network robot, and at present, the web crawler has been widely used in internet search engines or other similar websites, and can automatically collect page contents that can be visited by all search engines or websites, so that a user can quickly retrieve required information through the web crawler, and the page contents collected through the web crawler can be further processed by the search engines or websites, so that the search engines or websites can be trained based on the collected page contents. At present, even if the coverage rate of a very large-scale search engine on internet information is less than 50%, the resource of a retrieval server is far from keeping up with the increasing speed of the network scale. If the traditional information crawling method is still adopted, the coverage of information retrieval is smaller and smaller.

In order to solve the problems, according to the characteristic that a user only is interested in certain subject information during information retrieval, the invention provides a subject web crawler method for a specific subject webpage by using a text classification method.

Disclosure of Invention

In order to overcome the defects of the prior art, one of the purposes of the invention is to provide a topic web crawler method, which improves the correlation degree of topics and crawled documents and increases the coverage rate of a search engine by classifying the crawled documents.

One of the purposes of the invention is realized by adopting the following technical scheme:

a method of topic web crawler comprising the steps of:

initializing a topic crawler, acquiring a starting URL of the topic crawler, and loading the starting URL into a seed task queue;

crawling document contents, wherein the topic crawler sequentially acquires the initial URL from a task queue for crawling, and downloads a network document corresponding to the initial URL to the local;

text classification, namely performing topic relevance classification on the network document to obtain topic relevant texts;

and text storage, namely storing the subject related text to a data warehouse in a structured manner.

Further, the step of text classification comprises the steps of:

preprocessing a document, namely performing word segmentation and stop word removal processing on the network document;

text data conversion, namely converting the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec;

reducing the dimension of the word vector, and reducing the dimension of the word vector by adopting a Principal Component Analysis (PCA) algorithm to obtain a dimension-reduced word vector;

and text classification, namely classifying the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.

Further, step non-crawled URL collection is further included between the step document content crawling and the step text classification, the URL of the network document is obtained, and when the URL of the network document is not crawled, the non-crawled URL is added into the seed task queue.

Further, the step of document preprocessing specifically includes performing word segmentation on the network document by using a forward maximum matching algorithm and a CRF word segmentation algorithm.

An electronic device, comprising: a processor;

a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing one of the subject web crawler methods described above.

A computer-readable storage medium having stored thereon a computer program for execution by a processor of a subject web crawler method as described above.

A topic web crawler system comprises an initialization module, a document crawling module, a text classification module and a text storage module, wherein the initialization module acquires a starting URL of a topic crawler, loads the starting URL into a seed task queue and maintains a crawling sequence of the seed task queue; the document crawling module sequentially acquires the initial URL from a task queue for crawling, downloads a network document corresponding to the initial URL to the local, the text classification module performs topic relevance classification on the network document to acquire a topic related text, and the text storage module structurally stores the topic related text to a data warehouse.

The text classification module further comprises a document preprocessing module, a text data conversion module and a word vector dimension reduction module, wherein the document preprocessing module is used for performing word segmentation and stop word removal on the network document; the text data conversion module converts the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the word vector dimension reduction module adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the word vector to obtain a dimension-reduced word vector; and the text classification module classifies the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.

And further, the system also comprises an un-crawled URL collecting module, wherein the un-crawled URL collecting module acquires the URL of the network document and adds the un-crawled URL into the seed task queue.

Furthermore, the document preprocessing module performs word segmentation on the network document by adopting a forward maximum matching algorithm and a CRF word segmentation algorithm.

Compared with the prior art, the invention has the beneficial effects that:

the word vectors of text word segmentation results are obtained by adopting a CBOW model of word2vec, the PCA principal component analysis algorithm is adopted to reduce the dimension of the word vectors, the LSTM model is adopted to classify the reduced dimension word vectors, crawled documents are firstly classified, then the documents are selectively stored according to specific topics, the correlation degree of the topics and the crawled documents is improved, after the documents related to the topics are fully crawled and the document contents are structurally extracted, the non-crawled URLs related to the topics in the documents are further extracted, the non-crawled URLs are transmitted into a seed task queue, and the coverage rate of a search engine is increased.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings. The detailed description of the present invention is given in detail by the following examples and the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a subject web crawler method of the present invention;

FIG. 2 is a flow chart of text classification according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an LSTM model according to an embodiment of the present invention;

FIG. 4 is a block diagram of a system for a topic web crawler according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

A subject web crawler method, as shown in fig. 1, comprising the steps of:

initializing a topic crawler, collecting navigation page contents related to a topic on the Internet by adopting a traditional crawler, acquiring a large number of URLs (uniform resource locators) as initial URLs of the topic crawler, acquiring the initial URLs of the topic crawler, and loading the initial URLs into a seed task queue; preferably, the step of initializing the topic crawler further includes popping up a single starting URL in a seed task queue according to the crawler requirement, wherein the seed task queue mainly maintains the seed URL, and includes initializing and loading the URL, dequeuing operation of the single URL during crawling, and enqueuing operation of the URL extracted after crawling of the single document.

In an embodiment, document content crawls, a topic crawler acquires starting URLs from a task queue in sequence to crawl, network documents corresponding to the starting URLs are downloaded to the local, in the embodiment, the URLs in a seed subtask queue are crawled in sequence, the network documents corresponding to the current URLs in the seed subtask queue are crawled, the network documents are downloaded to the local, the network documents corresponding to the current URLs are subjected to the following steps of collecting, classifying texts and storing data, the next URL in the seed task queue is acquired, and the steps are repeated until all URLs in the seed task queue are crawled.

In one embodiment, preferably, a step of collecting the un-crawled URLs is further included between the step of crawling the document content and the step of text classification, the URLs of the network documents are obtained, and when the URLs of the network documents are not crawled, the un-crawled URLs are added into the seed task queue, and the coverage rate of the search engine is increased.

And text classification, namely performing topic relevance classification on the network documents to obtain topic relevant texts. As shown in fig. 2, preferably, the step text classification includes the steps of:

preprocessing a document, namely performing word segmentation and stop word removal processing on the network document; preferably, the step of document preprocessing is to perform word segmentation on the network document by using a Chinese word segmentation algorithm combining a forward maximum matching algorithm of the souguo corpus and a CRF word segmentation algorithm, and the processing of removing stop words mainly includes extracting all the classes, such as "fire", "eat", "and", "but", "this", etc., widely used and frequently appearing in the document from the document, and the stop words mainly include words of speech adverbs, auxiliary words, conjunctions, prepositions, and the like.

Text data conversion, namely converting the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the CBOW model is a probability model for predicting a middle word by using surrounding words, and for the middle word Wt, the probability that Wt belongs to a certain word in a dictionary is calculated by using surrounding 2K words Wt-K, Wt-K +1, … …, Wt-1+ K and Wt + K. For statement S, a CBOW model is utilized, which is a natural language probability:

where P (S) represents the probability that S is natural language, T is the size of the dictionary, and P (W1, …, Wt) represents the joint probability of words in the text, the following model is constructed for the entire text:

making the value of P (S) as large as possible to obtain the objective function of the model:

for the model of the entire text, the objective function is:

and solving the maximum value of the objective function to obtain a CBOW model training target. A training strategy based on a hierarchical structure is adopted, specifically, a hierarchical Softmax algorithm is adopted, Huffman codes are added into an output layer, a Huffman tree is constructed, leaf nodes of the Huffman tree have unique codes, and words of a corpus are represented. The non-leaf nodes of the Huffman tree represent the category of words, the input layer is word vectors of context words, the hidden layer performs vector addition operation on the word vectors of the input layer, the output of the hidden layer is connected with each non-leaf node of the output Huffman tree, and each connection has a corresponding weight.

Reducing the dimension of the word vector, namely reducing the dimension of the word vector by adopting a Principal Component Analysis (PCA) algorithm to obtain a dimension-reduced word vector; reducing the dimension of the text word segmentation vectors obtained by adopting the CBOW model by adopting a Principal Component Analysis (PCA) algorithm, and for a matrix A ═ a_ijThe dimension reduction process performed by (i ═ 1, …, m, j ═ 1, …, n) is as follows:

calculating the mean value a of each row of the matrix A_i(i＝1,2,…,n)；

Calculating adjusted matrix B ═ B_ikIn which b is_ik＝a_jk-a_k(j＝1,…,m)；

Calculating covariance matrix S ═ S_ik}；

And calculating an eigenvalue lambda and an eigenvector e of the S matrix.

Arranging the eigenvalues lambda in order of big to small (lambda)₁≥λ₂≥…≥λ_n) And sequentially selecting the corresponding feature vector e to construct a feature vector matrix, wherein all row vectors in the feature vector matrix represent each document.

Text classification, adopting LSTM model of attention model to classify the reduced dimension word vector to obtain classification result, using attention model in word vector coding stage, inputting sequence x of text₁,x₂,…,x_TAccumulating and averaging the input vectors of the historical nodes to obtain an input vector X 'of the text ensemble, wherein X' is the final input of the word vector encoding stage, and h is the final input of the word vector encoding stage₁,h₂,h₃,…,h_tCorresponding to the input sequence x₁,x₂,…,x_THidden layer state value of H_kThe model structure diagram corresponding to the hidden layer state value of the input X' is shown in FIG. 3, a in FIG. 3_ikIs the attention probability, x, of the history node to the last node₁,x₂,…,x_TFor a word of text, X' is the input vector for the article population. To reduce the semantic impact of non-keywords on the entire text, x is computed₁,x₂,…,x_TThe specific calculation process for the overall influence weight of the article is as follows:

semantic coding to compute attention distribution probability:

e_ki＝v tanh(Wh_k+Uh_i+b)

a_kiis the attention probability weight of node i to node k, T is the number of elements of the input sequence, v, W, U is the weight matrix, h_kFor the last input of the corresponding hidden layer state, h_iFor the ith element of the input sequenceHidden layer state values corresponding to the pixels;

calculating semantic codes and feature vectors of attention distribution probability:

H_k′＝H(C，h_k，X′)

the final semantic code is to take the attention probability distribution semantic code containing the historical nodes and the text overall vector as the input of the traditional LSTM module, and the hidden layer state value H of the final node_kAnd classifying the acquired documents according to the required subjects for the final feature vector through the steps so as to judge whether the documents belong to a certain subject.

And text storage, namely storing the subject related texts to a data warehouse in a structured manner, and adopting a NoSql database to store the subject related texts because the document data structures are not uniform.

An electronic device, comprising: a processor; a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing one of the subject web crawler methods described above; a computer-readable storage medium having stored thereon a computer program for execution by a processor of a subject web crawler method as described above.

A topic web crawler system is shown in FIG. 4 and comprises an initialization module, a document crawling module, a text classification module and a text storage module, wherein the initialization module acquires a start URL of a topic crawler, loads the start URL into a seed task queue and maintains a crawling sequence of the task queue; the seed task queue mainly maintains seed URLs, and comprises the steps of initially loading the URLs, performing out-queue operation on a single URL during crawling, and performing in-queue operation on the extracted URL after a single document is crawled; the document crawling module sequentially acquires starting URLs from the task queue for crawling, downloads the network documents corresponding to the starting URLs to the local, the network documents are unstructured texts, the text classification module performs topic relevance classification on the network documents to obtain topic related texts, and the text storage module structurally stores the topic related texts into a data warehouse, for example, the NoSql data warehouse is used for storing the topic related texts.

In an embodiment, preferably, the text classification module further includes a document preprocessing module, a text data conversion module, and a word vector dimension reduction module, where the document preprocessing module performs word segmentation and stop word removal on the network document; further, the document preprocessing module performs word segmentation on the network document by adopting a forward maximum matching algorithm of the souguo corpus and a CRF word segmentation algorithm. The text data conversion module converts the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the word vector dimension reduction module adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the word vector to obtain a dimension-reduced word vector; and the text classification module classifies the dimensionality reduction word vectors by adopting an LSTM model to obtain a classification result.

In an embodiment, the method further preferably includes an un-crawl URL collecting module, where the un-crawl URL collecting module obtains a URL of the web document, and adds the un-crawl URL to the seed task queue.

The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner; those skilled in the art can readily practice the invention as shown and described in the drawings and detailed description herein; however, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims; meanwhile, any changes, modifications, and evolutions of the equivalent changes of the above embodiments according to the actual techniques of the present invention are still within the protection scope of the technical solution of the present invention.

Claims

1. A method for topic web crawler, comprising the steps of:

collecting the un-crawled URL, acquiring the URL of the network document, and adding the un-crawled URL into the seed task queue when the URL of the network document is not crawled;

the text storage is used for storing the subject related text to a data warehouse in a structured manner;

the step of text classification comprises the steps of:

2. The subject web crawler method of claim 1, wherein: the step of document preprocessing specifically comprises the step of performing word segmentation on the network document by adopting a forward maximum matching algorithm and a CRF word segmentation algorithm.

3. An electronic device, characterized by comprising: a processor;

a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing the method of any of claims 1-2.

4. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program is executed by a processor for performing the method according to any of claims 1-2.

5. A subject web crawler system, comprising: the system comprises an initialization module, a document crawling module, a text classification module and a text storage module, wherein the initialization module acquires a starting URL of a topic crawler, loads the starting URL into a seed task queue and maintains a crawling sequence of the seed task queue; the document crawling module sequentially acquires the initial URL from a task queue for crawling, downloads a network document corresponding to the initial URL to the local, the text classification module performs topic relevance classification on the network document to acquire a topic related text, and the text storage module structurally stores the topic related text to a data warehouse; the non-crawl URL collection module acquires the URL of the network document and adds the non-crawl URL into the seed task queue;

the text classification module also comprises a document preprocessing module, a text data conversion module and a word vector dimension reduction module, wherein the document preprocessing module carries out word segmentation and stop word removal processing on the network document; the text data conversion module converts the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the word vector dimension reduction module adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the word vector to obtain a dimension-reduced word vector; and the text classification module classifies the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.

6. The subject web crawler system of claim 5, wherein: and the document preprocessing module performs word segmentation on the network document by adopting a forward maximum matching algorithm and a CRF word segmentation algorithm.