CN106777043A

CN106777043A - A kind of academic resources acquisition methods based on LDA

Info

Publication number: CN106777043A
Application number: CN201611128684.9A
Authority: CN
Inventors: 刘柏嵩; 费晨杰; 王洋洋; 尹丽玲; 高元
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2017-05-31

Abstract

A kind of academic resources acquisition methods based on LDA are provided, use Theme Crawler of Content, LDA topic models are used simultaneously, training corpus is first provided and obtains subject document for the training of LDA topic models, Theme Crawler of Content further includes theme determining module, similarity calculation module, URL prioritization modules on the basis of general network reptile；In Theme Crawler of Content crawling process, the calculating of Topic Similarity is instructed with subject document, choose URL of the Topic Similarity more than given threshold, one URL queue for not accessing webpage is safeguarded by Theme Crawler of Content, Theme Crawler of Content successively constantly accesses the webpage of each URL by putting in order for URL queues, corresponding academic resources are crawled, and database is stored in after the academic resources tag along sort that will constantly be crawled, until non-access queue URL is sky；And provide academic resources database opening API called for displaying；Machine learning is dissolved into the acquisition methods of academic resources, quality and efficiency that academic resources are obtained is improved.

Description

Academic resource acquisition method based on LDA

Technical Field

The invention relates to machine learning, information retrieval and web page data mining, in particular to an academic resource acquisition method based on LDA.

Background

With the electronization of academic resources, discovering and mining the academic resources in the areas of interest of researchers from massive academic resources is becoming a research hotspot. In order to adapt to the characteristics of massive digitalized academic resources and multi-source isomerism, the method is different from the traditional topic discovery method based on keyword word frequency, such as word sharing analysis, quotation analysis and the like, some new methods and models based on machine learning and data mining are continuously applied to the field of academic resource classification, and the method obtains a good effect in the aspect of academic resource topic discovery through practice discovery, typically, a latent Dirichlet allocation model (LDA), Social Network Analysis (SNA) and the like.

The web crawler is a program or script that automatically captures internet information according to certain rules. Subject crawler: refers to a web crawler that selectively crawls pages that are related to a predefined topic. The theme refers to a defined professional field or an interest field, such as aerospace, biomedicine, information technology and the like, and specifically refers to a set formed by a series of related words.

Lda (late Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three layers of structures of words, topics and documents. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution. LDA is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document collections (document collections) or corpora (corpus). It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. The bag-of-words approach does not take into account word-to-word ordering, which simplifies the complexity of the problem and also provides opportunities for model improvement. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words. The LDA topic model is a typical model for topic mining in natural language processing, can extract potential topics from text corpora, provides a method for quantifying research topics, and is widely applied to topic discovery of academic resources, such as research hotspot mining, research topic evolution, research trend prediction and the like, so that a webpage topic crawler based on the LDA topic model is designed. From the current situation of LDA application, various existing technical means for acquiring digital chemistry resources (journal papers, patents, and large-pitch papers) have certain limitations.

Academic research and technical development need to acquire the existing academic resources and technical information, and generally, the existing academic resources and technical information are searched and acquired by the related personnel of each academic research team or technical development team, so that a large number of repeated searching and acquiring phenomena are serious, and the searching and acquiring work usually occupies a large amount of time and energy of the related personnel. With the rapid development of the internet, the number of web pages is rapidly increased, but due to the limitation of computing resources, network tool resources and storage resources, the traditional search technology has difficulty in covering different requirements of mass users. Therefore, intelligent, personalized and domain-oriented search engine technology is produced, and research on vertical search engines becomes a hot research direction in time. Before a vertical search engine is built, the most important link is how to utilize a topic crawler to capture the information content of a related topic field from the vast internet and acquire accurate and comprehensive academic resource information of a target topic field. Internet information is updated rapidly, and new nouns, new concepts and new ideas continuously appear in all disciplines; how to enable the topic crawler to have a self-learning function so as to adapt to the quick updating of internet information.

For literature information service organizations, such as libraries at university and scientific and technical information stations, it is important to acquire on-line literature information and push relevant resources corresponding to professions for relevant personnel. At present, methods for acquiring resources by using a topic crawler based on LDA are all directed to the requirements of an academic research team or a technical development team, and the designed topic crawler is defined to capture only a certain academic field or technical field, or a single topic. Only academic or technical resources in a single field or a single topic can be provided during a period of one crawl of the topic crawler. Then, how to make the topic crawler crawl at a time and obtain academic or technical resources of multiple academic fields or technical fields or multiple topics, so as to provide the required academic or technical resources for multiple academic research teams or technical development teams, and simultaneously ensure the correspondence and resource range of the academic or technical resources to meet the requirement of multiple academic research teams or technical development teams.

The present invention is directed to solving the above-mentioned problems.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an academic resource acquisition method based on LDA in view of the above technical current situation. Aiming at the defects in the prior art, the invention provides an academic resource topic crawler based on LDA in the aspect of an academic resource acquisition method, and the crawler topic similarity calculation adopts a method of combining VSM and SSRM so as to more accurately and effectively acquire the relevant data of the topic most interesting for scientific research users from massive academic resources.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an academic resource obtaining method based on LDA, the academic resource is an electronic document published on the Internet, and a subject crawler operated by a computer is used to obtain the electronic document belonging to a target academic subject from the Internet, the method is characterized in that a corpus is configured for an LDA subject model by using the LDA subject model operated by the computer, corpus corpora is used for training the LDA subject model, a subject document of the crawling of the subject crawler is obtained by calculating the LDA subject model, and the subject document is a set of subject associated words; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; in the crawling process of the theme crawler, a theme determining module of the theme crawler determines a target theme and a theme document thereof, the theme document is used for guiding the calculation of theme similarity, a similarity calculating module calculates and judges the theme similarity of each anchor text on a crawled page in combination with the content of the page, hyperlinks of the anchor texts combined with the page with the theme similarity smaller than a set threshold are removed, URLs of the anchor texts combined with the page with the theme similarity larger than the set threshold are selected, the theme crawler maintains a URL queue of unvisited webpages pointed by hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the theme crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the classified tags into a database, and aiming at the theme document of the crawl, until the URL of the non-access queue is empty; and provides open APIs for the academic resource database to expose calls.

The academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; continuously repeating the subject crawler crawling process of claim 1; therefore, the gathered theme related words of each theme document are continuously supplemented and updated, the crawled academic resources are continuously supplemented and updated, and the precision ratio and recall ratio of the academic resources obtained for the target academic theme are continuously improved.

In order to simultaneously obtain related academic resources for a plurality of academic resource demanders paying attention to different academic topics from the Internet, wherein the academic topics are a plurality of manually set academic topics, the related academic resources of key words of the academic topics are manually given to each academic topic according to knowledge experience, and are collected on related websites on the Internet, and the collected related academic resources are used as an initial corpus for LDA topic model training; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics.

The academic subjects can be a plurality of academic subjects covering all subjects and formed by training of an LDA subject model, a classification number for all the academic fields is determined artificially according to specific needs of classification refinement degrees of all the academic fields and serves as the academic subject number, a sufficient number of text resources are randomly collected on related websites on the Internet according to knowledge and experience of an operator and serve as an initial corpus for training of the LDA subject model, the LDA subject model is trained to obtain subject documents which cover a plurality of academic subjects corresponding to all the subjects and corresponding to the academic subject number and are classified by the LDA subject model, associated word columns of all the subject documents are read, and subject names are named artificially according to the knowledge and experience; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics.

The electronic documents published on the Internet comprise papers, periodicals, news and patent documents, abstract of academic resources is used as a training corpus, topics and topic documents are obtained through calculation of an LDA topic model, the topic documents guide calculation of topic similarity in the crawling process of a topic crawler, then crawled contents are classified and labeled and then stored in a database to serve as new linguistic data of the LDA training model, and finally open API of the academic resource database is provided for display and calling; the method comprises the following specific steps:

step one, downloading and preprocessing abstracts of academic resources in a plurality of existing fields, manually classifying the abstracts into different categories according to the academic fields, and respectively using the categories as training corpora of a plurality of subjects of LDA;

inputting LDA topic model parameters, wherein the LDA topic model parameters comprise K, alpha and beta, the value of K represents the number of topics, the value of alpha represents the weight distribution of each topic before sampling, the value of beta represents the prior distribution of each topic to words, a plurality of topics and topic documents with more subdivided topics are obtained through training, and each topic document is used for guiding a crawler;

step three, each crawler maintains a hyperlink queue of the non-accessed webpage from the selected high-quality seed URL, continuously calculates the similarity between the text in the webpage and the text and the theme pointed by the anchor text link in the webpage, orders and updates the crawling URL queue according to the calculated similarity value, and captures the webpage content most relevant to the theme;

step four, after the academic resources acquired by the topic crawler are marked with corresponding topic labels, the academic resources are stored in a database and used as new language materials for training LDA (latent dirichlet allocation) for updating topic documents;

and step five, providing an open API of the academic resource database for display and calling.

The first step comprises the following specific sub-steps:

(a) corpus collection: downloading abstracts of academic resources in a plurality of existing fields as training corpora;

(b) text preprocessing: extracting abstract, Chinese word segmentation and removing stop words;

(c) classification into corpus: artificially classifying into different categories according to the academic field, and respectively using the categories as training corpora of a plurality of subjects of LDA.

The third step comprises the following specific substeps:

(a) the initial seed URL selects a better seed site facing a specific theme;

(b) extracting webpage content: downloading a page pointed by the URL with high priority, and extracting required content and URL information according to the HTML tag;

(c) calculating the topic relevance of the webpage content, and judging and determining the acceptance or rejection of the webpage;

(d) sequencing the importance degree of the URL of the unvisited webpage;

(e) and (d) repeating the processes from (b) to (d) until the URL of the unvisited queue is empty.

In the substep (c), when the topic crawler climbs through each electronic document to perform topic relevance calculation and judgment, the topic relevance of the crawled page is calculated by adopting a generalized vector space model GVSSM combining two topic similarity calculation algorithms of VSM and SSRM, so as to determine the choice of the page.

The academic topic is represented by a set of semantically related words and a weight indicating that the words are related to the academic topic, i.e., the academic topic Z { (w)₁,p₁),(w₂,p₂),…,((w_i,p_i),…,w_n,p_n) In which w₁,w₂,…,w_nRepresenting words relating to academic topic Z, p₁,p₂…,p_nAre respectively a word w₁,w₂,…,w_nCorrelation value with academic topic Z, set as w_iFor the ith word associated with the academic topic Z, 1 ≦ i ≦ n, denoted in LDA as academic topic Z { (w)₁,p(w₁|z_j)),(w₂,p(w₂|z_j)),…,(w_n,p(w_n|z_j) ) }, any jth academic topic is denoted as Z_jWherein p (w)_i|z_j) The expression w_iSelect academic topic Z_jThe probability of (c).

The topic document generation process is a probability sampling process of the model and comprises the following specific sub-steps:

(a) generating the length N of the document to Poisson () for any document d in the corpus, namely, the length N obeys Poisson distribution;

(b) generating a topic vector theta for any document d in the corpus, wherein the topic vector theta is between theta and Dirichlet (alpha), namely the topic vector theta obeys Dirichlet distribution;

(c) the ith word w in document d_iGeneration of (1): first, a theme z is generated_j，z_jMultinomial (θ), i.e. z_jObeying a polynomial distribution; then, for the subject z_jGenerating a discrete variableNamely, it isObeying a dirichlet distribution; finally generate so thatThe word with the highest probability.

According to the invention, theme semantic information is deeply mined mainly through an LDA theme model, a good guidance base is constructed for a theme crawler of academic resources, and the academic resources obtained by the theme crawler are used for updating the LDA theme. Machine learning is integrated into the academic resource acquisition method, and accuracy and quality of academic resource acquisition are improved. By adopting the distributed multi-thread crawler, academic resources of a plurality of themes are obtained simultaneously, and the speed and the number of the resources are improved well.

The academic resource acquisition method based on LDA has the following characteristics:

(1) by means of the LDA topic model, topic semantic information is deeply mined, a good guidance foundation is constructed for topic crawlers of academic resources, machine learning is integrated into the academic resource acquisition method, and the quality and efficiency of academic resource acquisition are improved.

(2) Academic resources obtained by the topic crawler are used for updating the LDA topic, updating the topic model periodically, following the development trend of the academic, and providing forward resources of related fields for scientific researchers.

(3) In the topic crawler topic similarity calculation module, a method of combining VSM and SSRM is adopted to balance cosine similarity and semantic similarity, so that a better topic matching effect is achieved.

(4) The abstract of academic resources is adopted as the training corpus of the LDA topic model, and the method has more advantages in topic extraction breadth and topic fineness compared with other corpora.

(5) By adopting the distributed crawler frame, academic resources with different themes can be synchronously captured, and the time loss caused by the calculation of the mixed similarity is made up.

Drawings

FIG. 1 is a schematic flow diagram of the overall process of the present invention for a single subject;

FIG. 2 is a schematic representation of the subject crawler framework of the present invention;

FIG. 3 is a schematic view of an LDA model;

FIG. 4 is a schematic diagram of a topic and a topic document;

FIG. 5 is a diagram illustrating a text before preprocessing a corpus text;

FIG. 6 is a schematic diagram of a preprocessed corpus text;

FIG. 7 is a flow chart illustrating the overall method of the present invention for multiple subjects.

Related concept noun interpretation

Web crawlers: and automatically capturing programs or scripts of the internet information according to certain rules.

Subject crawler: refers to a web crawler that selectively crawls pages that are related to a predefined topic.

The word: is the basic discrete unit of the data to be processed, for example, in text processing, a word is an English word or a Chinese word with independent meaning.

Subject matter: the term "a defined professional field or a defined field of interest" refers to a set of related words, such as aerospace, biomedicine, information technology, and the like.

Subject document: refers to a collection of terms that describe the topic, with the terms themselves being highly related to the topic, such as keywords used in queries in a search engine.

A topic determination module: is a function module of the theme crawler, which refers to the determination of predefined themes, there are two general methods: artificially determining keywords; the program extracts keywords of the initial set of web pages.

Distributed crawler: when the internet resource information is crawled, the information of different subjects on the current page can be simultaneously acquired, and the final achieved target is as follows: the crawler crawls once, network resources of a plurality of themes can be obtained, and efficiency and yield are better improved compared with that of a crawler with a single theme.

LDA model training: and (3) processing the corpus information by an LDA statistical model, calculating the probability that each word in one article belongs to different topics, and dividing the words into corresponding topics according to probability values.

And (3) corpora required by training: text required for model training.

Detailed Description

The following describes the embodiments of the present invention in detail.

An academic resource acquisition method based on LDA, the academic resource is various electronic documents published on the Internet, including but not limited to various papers, periodicals, news, patent documents, a topic crawler operated by a computer is used, and an LDA topic model operated by the computer is used, wherein the LDA topic model is shown in figure 3; configuring a corpus for the LDA topic model, wherein the corpus of the corpus is used for training the LDA topic model, and a topic document of the current crawling of the topic crawler is obtained by calculating through the LDA topic model, and the topic document is a set of topic associated words, as shown in FIG. 4; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler, as shown in FIG. 2; in the crawling process of the theme crawler, a theme determining module of the theme crawler determines a target theme and a theme document thereof, the theme document is used for guiding the calculation of theme similarity, a similarity calculating module calculates and judges the theme similarity of each anchor text on a crawled page in combination with the content of the page, hyperlinks of the anchor texts combined with the page with the theme similarity smaller than a set threshold are removed, URLs of the anchor texts combined with the page with the theme similarity larger than the set threshold are selected, the theme crawler maintains a URL queue of unvisited webpages pointed by hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the theme crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the classified tags into a database, and aiming at the theme document of the crawl, until the URL of the non-access queue is empty; and provides open APIs for the academic resource database to expose calls.

The academic resources crawled by the topic crawler each time are used as new corpora for training the LDA topic model; continually repeating the subject crawler crawling process as described above; therefore, the gathered theme related words of each theme document are continuously supplemented and updated, the crawled academic resources are continuously supplemented and updated, and the precision ratio and recall ratio of the academic resources obtained for the target academic theme are continuously improved. The academic theme may be a single academic theme, and the flow of the entire method of the present invention for the single theme is shown in fig. 1.

The academic subjects are a plurality of manually set academic subjects, relevant academic resources of key words of the academic subjects are manually collected on relevant websites on the Internet according to knowledge experience, and the collected relevant academic resources are used as an initial corpus for LDA subject model training; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics.

The academic subjects can be a plurality of academic subjects covering all subjects, which are trained by an LDA subject model, a classification number for all the academic fields is artificially determined according to specific needs of classification refinement degrees of all the academic fields, the classification number serves as the academic subject number, a sufficient number of text resources are randomly collected on related websites on the Internet according to knowledge experiences of operators and serve as an initial corpus for the LDA subject model training, the LDA subject model training is carried out to obtain subject documents which cover a plurality of academic subjects corresponding to all the subjects and are classified by the LDA subject model, associated word columns of all the subject documents are read, and subject names are artificially named according to the knowledge experiences.

For the above two cases of multiple academic topics, the topic crawlers are multiple distributed crawlers distributed according to the number of the academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler obtains academic resources of multiple academic topics at the same time. The flow of the overall method for multiple topics is shown in fig. 7.

In order to facilitate operation, the abstract of academic resources can be used as a training corpus, topics and topic documents are obtained through calculation of an LDA topic model, the topic documents guide calculation of topic similarity in the crawling process of a topic crawler, then, crawled content is classified and labeled and then stored in a database and used as new linguistic data of the LDA training model, and finally, an open API of the academic resource database is provided for display and calling; the method comprises the following specific steps:

step three, each crawler maintains a crawling URL queue from the selected high-quality seed URL, and updates the crawling URL queue according to similarity sequencing by continuously calculating the similarity between texts in the webpage and texts and topics pointed by anchor text links in the webpage, and captures webpage contents most relevant to the topics;

The first step comprises the following specific sub-steps:

The third step comprises the following specific sub-steps:

(a) the initial seed URL selects a better seed site facing a specific theme;

(c) analyzing and judging the relevance of the theme, and determining the acceptance or rejection of the page; the invention mainly adopts the combination of the existing VSM technology and the SSRM technology to calculate the subject correlation;

(d) sequencing the importance degree of the URL of the unvisited webpage;

In the substep (c), when the topic crawler crawls through each electronic document to analyze and judge the topic relevance, the generalized vector space model GVSSM combining two topic similarity calculation algorithms of VSM and SSRM is adopted to calculate the topic relevance of the crawled page and determine the choice of the page.

The academic topic is composed of a group of semantically related words and a representation of the words and the placesThe related weights of the academic topics are expressed, namely the academic topics Z { (w)₁,p₁),(w₂,p₂),…,((w_i,p_i),…,w_n,p_n) In which w₁,w₂,…,w_nRepresenting words relating to academic topic Z, p₁,p₂…,p_nAre respectively a word w₁,w₂,…,w_nCorrelation value with academic topic Z, set as w_iFor the ith word associated with the academic topic Z, 1 ≦ i ≦ n, denoted in LDA as academic topic Z { (w)₁,p(w₁|z_j)),(w₂,p(w₂|z_j)),…,(w_n,p(w_n|z_j) ) }, any jth academic topic is denoted as Z_jWherein p (w)_i|z_j) The expression w_iSelect academic topic Z_jThe probability of (c).

The topic document generation process is a probability sampling process of a model and comprises the following specific sub-steps:

(c) the ith word w in document d_iGeneration of (1): first, a theme z is generated_j，z_j～Multinomial(θ)，

I.e. z_jObeying a polynomial distribution; then, for the subject z_jGenerating a discrete variableNamely, it isObeying a dirichlet distribution; finally generate so thatThe word with the highest probability. The LDA model is shown in fig. 3.

Where the values of α represent the weight distribution of the respective topic prior to sampling and the values of β represent the prior distribution of the respective topic to the word.

The distribution of all variables and their obeys in the LDA model is as follows:

θ～Dirichlet(α)，

the entire model can actually become a joint distribution of P (w | Z) by integrating the variables that may exist. Where w refers to a word and can be observed. Z is a variable of the topic and is the target product of the model. It can be seen that α, β are both initial parameters of the model. Then by integrating the variables present therein one can get:

where N is the word list length, w is the word, for θ -Dirichlet (α),the middle θ is integrated to obtain:

wherein,representing the number of times the feature word w is assigned to the topic j,indicating the number of feature words assigned to topic j,

the number of the characteristic words assigned to the subject j in the text d is represented, and n. (d) represents the number of all the characteristic words assigned to the subject in the text d.

From the above, it can be seen that the three variables that influence LDA modeling are mainly α, β and the number of topics K. In order to select a better topic number, the values of alpha and beta are fixed firstly, and then the change of the value of the expression after integrating other variables is calculated.

When the LDA model is adopted to carry out theme modeling on the text set, the theme number K has great influence on the performance of the LDA model for fitting the text set, so the theme number needs to be preset. According to the method, the optimal theme number is determined by measuring the classification effect under different theme numbers and is compared with the classification effect when the Perplexity value is used for determining the model to be optimally fit, on one hand, the method can obtain the more visual and accurate optimal theme number, and on the other hand, the difference between the corresponding classification effect and the actual result can be found out through the optimal theme number determined by the Perplexity value. The Perplexity value formula is:

wherein M is the number of texts in the text set, N_mIs the length of the m-th text, P (d)_m) The probability of generating the mth text for the LDA model is given by:

the subject crawler of the invention is additionally provided with three modules on the basis of the general crawler: the system comprises a theme determining module, a similarity calculating module and a URL priority ordering module, so that filtering and theme matching of a crawled page are completed, and finally, contents highly related to the theme are obtained.

1. A topic determination module: before the topic crawler works, a related topic word set of the topic crawler needs to be determined, namely a topic document is established. The topic word set is usually determined in two ways, one is manually determined, and the other is extracted from the initial page set. The subject word set is determined manually, the training and selection of the keywords have subjectivity, and the keywords extracted from the initial page have high noise and low coverage rate. The number of the subject words is used as the dimension of the subject vector, and the corresponding weight is the component value of the subject vector. The topic notation word set vector is: k is { K1, K2, …, kn }, and n is the number of topic words.

2. A similarity calculation module: in order to ensure that the web pages acquired by the crawler can be close to the topic as much as possible, the web pages must be filtered, and the web pages with low topic relevance (smaller than a set threshold value) are removed, so that the links in the web pages cannot be processed in the next crawling. Because the topic relevance of a page is very low, the webpage is likely to have some keywords only occasionally, the topic of the page may have little relation with the specified topic, the link meaning in processing the topic is very small, and the fundamental difference between the topic crawler and the common crawler is provided. The common crawler processes all links according to the set search depth, returns a large number of useless webpages as a result, and further increases the workload. It is obviously not feasible to use the whole text for the similarity comparison, and the text is generally required to be refined and extracted and converted into a data structure suitable for comparison and calculation, and simultaneously, the theme of the text is ensured to be embodied as much as possible. The feature selection adopted by a typical subject crawler is VSM, and also relates to TF-IDF algorithm. The method is based on semantic similarity calculation of the 'Hopkinson Web', and obtains the similarity value of the whole article and the theme by calculating the similarity between words of the document and the theme word document.

3. URL prioritization module: the URL priority ranking module is mainly used for screening potential pages with high similarity to the subject from the unvisited URLs, ranking according to the similarity, wherein the higher the similarity is, the higher the priority is, the higher the similarity is, the higher the priority is visited as far as possible, and therefore the high subject correlation of the visited pages is guaranteed. When ranking the unvisited URLs, the similarity between the page where the URL is located and the anchor text (text describing the URL) of the URL may be combined as an influencing factor of the priority ranking.

The invention utilizes the definition of the semantic information of each word in the 'Hopkins' to calculate the similarity between the words. In the knowns web, for two words W₁And W₂Let W be₁There is a concept:w2 has m concepts: W₁and W₂Is W₁Each concept ofAnd W₂Each concept ofIs expressed by the formula

Wherein i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to m

Therefore, the similarity between two words can be converted into the similarity calculation between concepts, and the similarity calculation between the concepts can also be attributed to the similarity calculation between the corresponding sememes because all the concepts in the network are finally attributed to the representation of the sememes. Suppose concept c₁And concept c₂Respectively has p and q sememes which are respectively marked as Concept c₁And concept c₂Is c₁Each of the sense of (1)And c₂Each of the sense of (1)The formula is as follows:

wherein i is more than or equal to 1 and less than or equal to p, and j is more than or equal to 1 and less than or equal to q

All concepts in the book "Zhi Wang" are finally ascribed to the representation of the sememes, so the calculation of the similarity between concepts can also be ascribed to the calculation of the similarity between the corresponding sememes. Because all the sememes form a tree-like sememe hierarchy based on the upper and lower relations, sememe similarity can be calculated by using the semantic distance of the sememes in the sememe hierarchy to obtain the concept similarity [27 ]]. Assume that two sememes and the path distance in the sememe hierarchy is Dis(s)₁,s₂) Then, the similarity calculation formula of the sememe is:

wherein Dis(s)₁,s₂) Is s₁And s₂Path length in the semantic hierarchy, here using the semantic context, is a positive integer.

The design of the crawler subject of the invention is based on the common crawler and further expands the functions. The method comprises the steps of initial seed URL determination, webpage content extraction, topic relevance analysis and URL sequencing in the whole processing process of the webpage.

(a) And selecting a better seed site facing a specific theme from the initial seed URL, so that the theme crawler can smoothly perform crawling work.

(b) Extracting webpage content: and downloading the page pointed by the URL with high priority, and extracting the required content and the URL information according to the HTML label.

(c) Topic relevance analysis is the core module of a topic crawler, which determines the choice of pages. The invention mainly adopts a generalized vector space model GVSSM combining the existing VSM technology and the SSRM technology to calculate the topic relevance.

And (4) topic relevance analysis, namely extracting text keywords by using TF-IDF, calculating the weight of the words, and carrying out relevance analysis on the webpage.

TF-IDF correlation calculation:

wherein w_diIs the weight of the word i in the document d, tf_iWord frequency, idf, of the word i_iInverse document frequency, f, for word i_iNumber of times word i appears in document d, f_maxThe number of times of occurrence of all words in the document d is the highest, N is the number of all documents, N is_iIs the number of documents containing the word i. TF-IDF is still the most effective method for extracting keywords and calculating the weight of words at present.

VSM topic relevance calculation:

whereinIs a word vector for the document d,word vectors for topic t, w_di,w_tiThe TF-IDF value of the word i in the document d and the subject t, and n is the number of common words appearing in the document d and the subject t. The algorithm only considers the frequency vector of the same word appearing in the document as the document similarity judgment, and does not consider the semantically existing relationship between the words, such as a near word, a synonym and the like, thereby influencing the accuracy of the similarity.

SSRM topic relevance calculation:

wherein w_di,w_tiThe TF-IDF value of a word i in a document d and a subject t, n and m are the word numbers of the document d and the subject t respectively, and Sem_ijIs the semantic similarity of word i and word j.

Wherein C is₁,C₂Are two concepts, equivalent to the word w₁And the word w₁，Sem(C₁,C₂) Is a concept C₁And concept C₂Semantic similarity of (C)₃Is C₁And C₂The lowest common concept shared, Path (C)₁,C₃) Is C₁To C₃Number of nodes on the Path, Path (C)₂,C₃) Is C₂To C₃Number of nodes on the path, Depth (C)₃) In some different entities, C₃Number of nodes on the path to the root node. The algorithm of the SSRM only considers the semantic relation, if the words in two articles are both similar words or synonyms, the similarity of the documents is 1 which is calculated, namely the documents are completely identical, which is obviously the defectAnd (7) determining.

The invention adopts a method for calculating the similarity by combining VSM and SSRM, also called generalized vector space model, GVSSM for short, and the calculation formula is as follows:

wherein Sim (d)_kT) is a document d_kThe topic similarity calculation method gives consideration to the word frequency factor of the document and the semantic relation between words, and effectively improves the topic similarity calculation accuracy by adopting a method of combining VSM and SSRM.

(d) The importance of the URLs of the unvisited web pages is ranked. The following formula is used to rank the URLs:

wherein priority (h) is the priority value of the un-accessed hyperlink h, N is the number of the search web pages containing h, Sim (f)_pT) topic similarity of the full text of the web page p (containing hyperlink h), Sim (a)_hT) is the topic similarity of the anchor text of the hyperlink h, and lambda is the weight value of the adjustment full text and the anchor text. The similarity calculation in the formula also adopts a method of combining VSM and SSRM, optimizes the priority sequence of the link queue of the un-crawled URL, and also effectively improves the accuracy of obtaining the subject academic resources.

Compared with the common web crawler, the topic crawler aims at grabbing the web page information related to the specific topic content, whether the web page is grabbed or not is judged by calculating the correlation degree of the web page and the topic, a URL queue to be crawled is maintained, and the web page is accessed according to the priority of the URL so as to ensure that the web page with high correlation degree is preferentially accessed.

The current subject crawler has several drawbacks: (1) the topic crawler determines a set of related topic words for the topic crawler before working. The topic word set is usually determined in two ways, one is manually determined, and the other is obtained by initial page analysis. The manual determination method has certain subjectivity; the method of extracting keywords from the initial page generally has a deficiency in topic coverage. Both traditional methods cause a small deviation when the topic crawler performs webpage topic similarity calculation. (2) At present, the core of a text heuristic-based topic crawler is page similarity calculation, whether a current crawled webpage is close to a topic is judged, except for the accuracy of a topic determination module, the most important is a similarity calculation algorithm, a VSM (vector space model) is usually adopted, a text is represented by word vectors on the basis of the assumption that different words are irrelevant, the similarity between documents is calculated through common word frequency, the semantic relation between the words of the words is often ignored by the algorithm, and the similarity value of semantically related articles is reduced.

The design of the subject crawler of the invention is based on a general crawler, and three core modules are added: the system comprises a theme determining module, a theme similarity calculating module and a URL sequencing module to be crawled. In order to overcome the defects, the invention provides the topic crawler based on the topic model LDA, improves the topic similarity algorithm and the URL priority ranking algorithm, and improves the content quality and accuracy of the topic crawler from the initial crawling and crawling processes. The main contribution points are as follows: (1) according to the LDA topic model, corpus topic semantic information is deeply excavated, a good guidance foundation is constructed for a topic crawler, machine learning is integrated into a resource acquisition method, and the accuracy and quality of resource acquisition are improved. (2) In the topic crawler topic similarity calculation module, a semantic similarity calculation method based on the 'Hotan' is adopted to balance cosine similarity and semantic similarity, so that a better topic matching effect is achieved.

An example of an application is illustrated below:

as an application of the library resource recommendation service, professional related resources (papers, patents, blogs, news, etc.) captured from the internet need to be pushed to researchers (teachers and students) in different schools, and resources related to corresponding professionals need to be pushed to the researchers in a plurality of schools at the same time. All academic fields are divided into 100 fields in a predetermined mode, namely 100 subjects are defined; the topic crawlers are 100 distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and all the distributed crawlers simultaneously obtain academic resources of the 100 academic topics. The flow of the whole method is shown in fig. 7.

The first step is as follows: the network crawler is used for randomly collecting 60000 news texts in a scientific network, a Chinese science and technology network, a Chinese news network and the like to serve as an LDA training corpus.

The second step is that: the method comprises the steps of preprocessing texts in a corpus, wherein the preprocessing comprises word segmentation (dividing a whole document into words which are the minimum unit of processing), stop word removal (filtering out words which are irrelevant to article contents, such as connection words and language words, such as's', 'because' and the like), processing into an input format which accords with a topic model, mainly using an open source tool IK word splitter, loading a stop word dictionary, and for a certain article, as shown in figure 5 before preprocessing, and as shown in figure 6 after preprocessing.

The third step: the method comprises the steps of predefining and dividing all academic fields into 100 fields, namely defining 100 subjects, using 60000 collected text resources as an initial corpus for LDA subject model training, obtaining 100 academic subject documents classified by the LDA subject model after the LDA subject model training, reading associated word columns of the subject documents, and artificially naming subject names according to knowledge and experience, wherein the initial corpus is used for LDA subject model training, and the subject documents are shown in FIG. 4.

The fourth step: and the theme obtained through the LDA training is used for guiding the theme judgment of the theme crawler, namely the similarity judgment of the webpage. The method comprises the following steps that a webpage contains information of a plurality of themes, and a theme crawler aims to capture required information according to a predefined theme when crawling the webpage; if a news webpage is provided with hyperlinks corresponding to a news original text in a plurality of news title background source codes, the topic similarity judgment is carried out according to the webpage content and the titles, the algorithm refers to the similarity calculation in the paper, the information with high similarity is the required information, and the information is put into a crawling queue.

The fifth step: the crawling queue ranks the topic similarity values from high to low, preferentially crawls and visits webpages with high similarity values, prints corresponding topic labels on the crawled contents and stores the crawled contents into a database; each topic crawler maintains a queue, and the corresponding topic names are stored in the database together when the topic crawlers are marked. And finally, newly added data is used as the new linguistic data of the LDA and is used for providing and calling a recommended classification system.

In the invention, the topic word library obtained by LDA training is obviously superior to a method for extracting topic words based on keywords and pages in the yield and precision in the actual crawling process; on the basis, through carrying out LDA subject word bank expansion on the keywords and the page subject words, the effect is better than that of the original single subject word determination scheme, and the feasibility and the effectiveness of the LDA expansion word bank are proved. Compared with the traditional VSM for calculating the similarity of the text, the similarity calculation method based on the 'Hotan' improves the similarity value among the documents through word semantics, and is obviously better in acquisition rate and precision performance. The two technologies are applied to the theme crawler, so that a good effect is obtained, and the quality and the efficiency of acquiring the specific theme information from the mass resources are greatly improved by combining the two technologies with specific application.

Claims

1. An academic resource obtaining method based on LDA, the academic resource is an electronic document published on the Internet, and a subject crawler operated by a computer is used to obtain the electronic document belonging to a target academic subject from the Internet, the method is characterized in that a corpus is configured for an LDA subject model by using the LDA subject model operated by the computer, corpus corpora is used for training the LDA subject model, a subject document of the crawling of the subject crawler is obtained by calculating the LDA subject model, and the subject document is a set of subject associated words; the topic crawler further comprises a topic determining module, a similarity calculating module and a URL priority ordering module on the basis of the common web crawler; in the crawling process of the theme crawler, a theme determining module of the theme crawler determines a target theme and a theme document thereof, the theme document is used for guiding the calculation of theme similarity, a similarity calculating module calculates and judges the theme similarity of each anchor text on a crawled page in combination with the content of the page, hyperlinks of the anchor texts combined with the page with the theme similarity smaller than a set threshold are removed, URLs of the anchor texts combined with the page with the theme similarity larger than the set threshold are selected, the theme crawler maintains a URL queue of unvisited webpages pointed by hyperlinks of the visited webpages, the URL queue is arranged according to the descending order of the similarity, the theme crawler visits the webpages of all URLs successively according to the arrangement order of the URL queue, crawls corresponding academic resources, continuously classifies tags of the crawled academic resources and stores the classified tags into a database, and aiming at the theme document of the crawl, until the URL of the non-access queue is empty; and provides open APIs for the academic resource database to expose calls.

2. The academic resource acquisition method according to claim 1, wherein the academic resources crawled each time by the topic crawler are used as new corpora for LDA topic model training; continuously repeating the subject crawler crawling process of claim 1; therefore, the gathered theme related words of each theme document are continuously supplemented and updated, the crawled academic resources are continuously supplemented and updated, and the precision ratio and recall ratio of the academic resources obtained for the target academic theme are continuously improved.

3. The academic resource acquisition method of claim 1, realizing simultaneous acquisition of related academic resources from the internet for a plurality of academic resource demanders who pay attention to different academic topics, respectively, wherein the academic topics are a plurality of academic topics artificially set, the related academic resources are collected at related websites on the internet by artificially giving the key words of each academic topic according to knowledge experience, and the collected related academic resources are used as an initial corpus for training the LDA topic model; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics.

4. The academic resource acquisition method of claim 1, wherein the academic subjects are a plurality of academic subjects covering all subjects trained by an LDA subject model, a classification number for all the academic fields is artificially determined according to specific needs of classification refinement degrees of all the academic fields, the classification number is used as the academic subject number, a sufficient number of text resources are randomly collected on related websites on the internet according to knowledge experience of an operator to serve as an initial corpus for the LDA subject model training, the LDA subject model training is performed to obtain subject documents covering a plurality of academic subjects corresponding to the academic subject number of all the subjects classified by the LDA subject model, associated columns of each subject document are read, and subject names are artificially named according to the knowledge experience; the topic crawlers are a plurality of distributed crawlers distributed according to the number of academic topics, each distributed crawler corresponds to one academic topic, and each distributed crawler simultaneously obtains academic resources of the plurality of academic topics.

5. The academic resource acquisition method of any one of claims 1 to 4, wherein the electronic documents published on the Internet comprise papers, periodicals, news and patent documents, and is characterized in that abstracts of academic resources are used as a training corpus, topics and topic documents are obtained through calculation of an LDA topic model, the topic documents guide calculation of topic similarity in a crawling process of a topic crawler, then, crawled contents are classified and labeled and then stored in a database to serve as new linguistic data of the LDA training model, and finally, an API opened by an academic resource database is provided for display and calling; the method comprises the following specific steps:

step three, each crawler maintains a hyperlink queue of the non-accessed webpage from the selected high-quality seed URL, the similarity between the text in the webpage and the text and the theme pointed by the anchor text link in the webpage is continuously calculated, the URL queue is ordered and updated according to the similarity, and the webpage content most relevant to the theme is captured;

6. The academic resource acquisition method of claim 5, wherein the step one comprises the following specific sub-steps:

7. The academic resource acquisition method according to claim 5, wherein the third step comprises the following specific sub-steps:

(a) the initial seed URL selects a better seed site facing a specific theme;

(d) sequencing the importance degree of the URL of the unvisited webpage;

(e) and (d) repeating the processes from (b) to (d) continuously until the URL of the unvisited queue is empty.

8. The academic resource acquisition method of claim 7, wherein in the substep (c), when crawling each electronic document for topic relevance analysis and determination, the topic crawler calculates topic relevance of the crawled page by using a generalized vector space model GVSM combining two topic similarity calculation algorithms of VSM and SSRM, and determines the choice of the page.

9. The academic resource acquisition method of claim 1, wherein the academic topic is represented by a set of semantically related words and a weight indicating that the words are related to the academic topic, namely, academic topic Z { (w)₁,p₁),(w₂,p₂),…,((w_i,p_i),…,w_n,p_n) In which w₁,w₂,…,w_nRepresenting words relating to academic topic Z, p₁,p₂…,p_nAre respectively a word w₁,w₂,…,w_nCorrelation value with academic topic Z, set as w_iFor the ith word associated with the academic topic Z, 1 ≦ i ≦ n, denoted in LDA as academic topic Z { (w)₁,p(w₁|z_j)),(w₂,p(w₂|z_j)),…,(w_n,p(w_n|z_j) ) }, any jth academic topic is denoted as Z_jWherein p (w)_i|z_j) The expression w_iSelect academic topic Z_jThe probability of (c).

10. The academic resource acquisition method of claim 1, wherein the topic document generation process is a probabilistic sampling process of the model, and comprises the following specific sub-steps:

(b) generating a topic vector theta for any document d in the corpus, wherein the topic vector theta is from Dirichlet (alpha), namely the topic vector theta obeys Dirichlet distribution;