CN110543564A

CN110543564A - Method for acquiring domain label based on topic model

Info

Publication number: CN110543564A
Application number: CN201910784200.3A
Authority: CN
Inventors: 黄改娟; 王胜; 张仰森; 蒋玉茹; 段瑞雪; 张雯
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2019-12-06
Anticipated expiration: 2039-08-23
Also published as: CN110543564B

Abstract

The invention provides a method for acquiring a domain label based on a topic model, which analyzes inherent characteristics of academic data on the basis of massive academic data, introduces academic word frequency characteristics to construct an FLDA topic model, and utilizes the topic model to extract the subject-phrase of academic documents of the same scholars. And secondly, introducing a field system, performing vector representation on the extraction result of the topic model and a system label, performing position weighting, and performing system mapping by using similarity to finally obtain the field label of the learner. Experiments show that compared with the traditional LDA model, the TFIDF algorithm based on statistics and the TextRank algorithm based on a network graph, the FLDA model has the advantages that the finally obtained label word effect is better, the accuracy rate is higher, and the label extraction method based on the topic model has good applicability in the academic field.

Description

Method for acquiring domain label based on topic model

Technical Field

The invention relates to a method for acquiring a domain label based on a topic model, in particular to a method for acquiring a domain label of a learner, and belongs to the technical field of information processing.

Background

The vigorous development of the economic society promotes various scientific and technological projects to be generated continuously, and the projects need participation of leading-edge scholars from project establishment, review and acceptance. In the past experience, the scholars are selected manually by a specially-assigned person, and the study fields of the scholars are counted manually, so that the scholars corresponding to the project fields are selected. However, the prior art methods tend to have the following disadvantages: a large number of projects exist in the same time, and the participation of students is required, so that the workload of artificial selection is increased invisibly; the human selection is easily influenced by subjectivity and local limitation of people, and is easily influenced by factors such as self knowledge level, social relation, personal preference and interests in the whole selection process, the judgment on the field of a student is incomplete, and the accuracy of the selection result is further influenced.

The field tag acquisition in the prior art is mainly divided into traditional field tag acquisition and field tag acquisition based on keywords.

in the traditional field, one method is to use the student's profile on each platform of the internet to extract what is called a web tag. Different from the extracted tags, the network tags are usually added by the user or others in a summary manner, and have no uniform standard, so that the obtained tags are complex and various and have low usability. In addition, because the internet content is random and has writing characteristics of authors, it is difficult to distinguish correct data information from useless information in the tag extraction process, and an extraction scheme with pertinence is designed for a specific platform and a specific learner during extraction, thereby increasing the workload invisibly.

The other method is based on ontology technology, a student research field management system of a P2P mode is designed, and the RDF technology is used for solving the problem acquired by the student field, but the method uses a specific template, so that the expansibility of the method is insufficient.

The method also comprises the steps of utilizing the J2EE technology to realize a student information management system, manually updating basic information, research fields and other information of a student, providing a consultation student recommendation module, and calculating the similarity between a user question text and the research fields of the student by using Pearson similarity to realize a student recommendation function.

in the field label extraction based on the keywords, there are a plurality of extraction methods, and the commonly used keyword extraction bases include statistics, topics, network diagrams and the like. Since the field is provided, related researchers have proposed various methods in succession, which are generally divided into two categories, namely supervised and unsupervised, wherein the supervised method requires artificial labeling of language materials and is applicable to the case of small texts, but the cost of artificial labeling is higher with the increase of mass internet data, and the method gradually turns to the unsupervised method in recent years. The core thought of the statistical method is to extract the statistical information of words in the text, and the method does not need training data and directly uses word frequency, positions and the like for judgment and screening. For example, one method is to modify the TFIDF result weight by using a weighting factor and a word contribution degree to improve the keyword extraction effect in the subdivided field; and the N-element language model directly uses word sequences for calculation, and can classify the documents in the field without performing operations such as word segmentation, feature extraction and the like on the documents, so as to find out the field labels of the trainees. The topic model utilizes probability distribution to realize keyword extraction, and the LDA (latent Dirichlet distribution) model is mainly popular at present. In order to research the behavior evolution process of a user, extracting the history interest of the user by using a static LDA and an improved fLDA to extract subject terms; in order to solve the problem of data sparseness in microblog hot events, time series characteristics and word frequency weighting characteristics are introduced into an LDA algorithm, and topic keywords obtained by the method have high interpretability and can well show contents shown by topics. The prior art further provides a topic clustering method based on LDA, the method firstly clusters the keywords obtained by LDA, and uses the result to optimize the topic obtained by LDA, and the method can effectively improve the accuracy and recall rate of the clustering result. In terms of network graphs, the TextRank algorithm adapted based on PageRank is the most well known. In order to solve the problem of poor extraction result of the academic keywords, the weight of the candidate result in the academic field is calculated by using priori knowledge, and then the candidate keywords are comprehensively ranked by combining with TextRank, so that the academic keywords with high correlation are finally obtained; and a probability migration matrix is constructed by utilizing the word vectors, so that the textrank algorithm is improved, and the performance of the algorithm is improved.

Unsupervised methods based on statistics, network diagrams and the like do not need manual labeling of the corpus in advance, but seriously depend on the effect and scale of the corpus, for example, the methods such as TFIDF and the like have simple structures, and extracted keywords lack information such as distribution conditions, semantics and the like. Although TextRank and other methods can obtain the distribution information of the keywords, the construction of the network graph requires a large amount of data to construct edges, and the extracted keywords lack topic relevance. Despite the above-mentioned disadvantages, the unsupervised approach still has advantages in terms of workload.

the invention is to use an extraction method based on a theme, regard an academic document set as a corpus to be extracted, and use an improved FLDA theme model to extract the corpus to obtain a theme distribution matrix, thereby realizing the automatic acquisition of tags.

Disclosure of Invention

in order to solve the problems in the prior art, the invention provides a method for acquiring a domain tag based on a topic model, which analyzes the inherent characteristics of academic data on the basis of mass academic data, introduces academic word frequency characteristics to construct an FLDA topic model, and utilizes the topic model to extract the subject-phrase of the academic document of the same learner. Secondly, introducing a domain system, performing vector representation on the extraction result of the topic model and a system label, performing position weighting, and performing system mapping by using similarity to finally obtain the domain label of the learner

in order to achieve the above technical object, the present invention adopts the following technical means.

a method for obtaining a domain label based on a topic model, referring to FIG. 1, includes the following steps:

s1, preprocessing data

acquiring an initial data set;

s2, extracting keywords

Extracting 'theme-phrase' through FLDA (flash memory access), carrying out weight assignment on the phrase according to the position where the phrase appears, and carrying out vector characterization on the phrase by using word2 vec;

S3, Domain architecture mapping

Mapping the subject-phrase to a system to realize the unified management of the field of scholars;

S4, comprehensive sorting

and (4) carrying out weighted sequencing on the vector characterization result and the weight assignment result, and obtaining the label word which can represent the learner most through a threshold value.

according to the domain label obtaining method based on the topic model, in particular, S1, data preprocessing comprises S11, data deduplication processing and S12, word segmentation is carried out.

specifically, in order to eliminate the influence of the repeated data caused by the crawling of multiple data sources on the calculation result, in the data preprocessing stage at S1, data deduplication processing needs to be performed at S11, so as to obtain a student document set.

The invention uses word co-occurrence, author contribution and keyword coincidence rate to construct a cleaning model:

For two texts to be compared, firstly judging whether DOIs of the two texts are the same, and directly filtering the same DOIs;

and for the DOIs which are not the same or do not exist, firstly, judging the title by using word co-occurrence, continuously judging the co-occurrence times of the authors and the co-occurrence times of the keywords if the co-occurrence degree is more than 80%, and judging that the result is repeated and eliminated if the co-occurrence times of the authors is more than 1 and the co-occurrence rate of the keywords is more than 0.5.

the co-occurrence formula is as follows:

A, B is the length of the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A ^ B) is the length after the intersection of the two title word sets, and min { len (A), len (B) } is the minimum value of the two lengths.

And S12, word segmentation.

In the word segmentation stage, the paper keyword data is extracted and added as a user dictionary of the word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.

in addition, the whole corpus data is sorted according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.

according to the method for obtaining the domain label based on the topic model, in particular, in the keyword extraction, in step S2, "topic-phrase" extraction is performed through FLDA.

In the topic-based keyword extraction model, an LDA topic model is the mainstream, and the model considers that a batch of documents contains multiple topics, and each topic can be approximated by a series of phrases. A document is formed by selecting a topic with a certain probability, then selecting a phrase under the current topic with a certain probability, and repeating the process until a document is formed. The LDA topic extraction process is the inverse of the above process. The LDA topic model is generally widely applied in the news field, but in the scientific literature field, the topic modeling effect is affected by the special word frequency distribution of the scientific literature.

Through statistical analysis of academic documents of scholars, the frequency information of the academic documents is found to meet power function distribution, for example, fig. 2 shows the first 2000 high-frequency words of the academic documents, wherein the abscissa represents the corresponding sequence number of the high-frequency words after the high-frequency words are sorted according to the descending order of frequency, and the ordinate represents the frequency of the high-frequency words.

statistics shows that the words with the top 10% of the ranking frequency words occupy 81.1% of all academic document word sets, the Zipf distribution is met, and the word frequency researches find that the words which can represent the subject most often are not extremely high frequency words and extremely low frequency words but middle and high frequency words with higher frequency. If the document is directly extracted by using the LDA model, the loss of some intermediate frequency words can be caused, meanwhile, characteristic words with higher word frequency are usually paired and appear, and the probability that the high frequency words are distributed to topics is usually higher, so that the discrimination of each topic is not high, and in S1, although stop word filtering is performed in the data preprocessing process, complete filtering cannot be performed.

Therefore, the invention provides a word frequency weighted LDA topic model, which is characterized in that firstly, the word frequency information in a document is counted, the word frequency characteristics are introduced into the Gibbs sampling process, the influence of high-frequency words is reduced, the influence of medium-frequency characteristic words is improved, and an FLDA model is constructed, so that the model does not excessively emphasize high-frequency characteristic words. The FLDA model is as follows:

The LDA model obtains sampling parameters and theta through Gibbs sampling, and the parameters and the theta are obtained to construct a converged Markov chain so as to extract a proper sample from the Markov chain.

The distribution process of LDA to phrases is sampling zi. Wherein, the posterior formula of zi is shown as the following formula:

P(z＝j|z，w)→P(w|z＝j，z，w)P(z＝j|z)

J is the word assigned the topic j to the current word Wi, z-i is the sum of the word weights assigned to non-zi, and W-i is the word at the non-current position.

P (w | z) is known to be only correlated, so by integrating over, we get the following:

for Gibbs sample parameters, for Gibbs sample parameters corresponding to the current topic j, for integrating the parameters,

Is a polynomial distribution of "topic-phrase" following the following formula:

In addition, at the same time, it is the prior distribution, so integrating the posterior probability can obtain the following formula:

Wherein, the sum of the weights of the words which are assigned to the subject j and are the same as the word w is the sum of the weights of all the words assigned to the subject j, β is a parameter of Dirichlet distribution, and v is the size of the lexicon.

Similarly, p (z) is related only to θ, so by integrating over θ, the following equation can be derived:

Represents the sum of the word weights assigned to topic i in di, and T is the number of topics.

the following formula is obtained by combining the formulas:

from the above calculations, a non-standard distribution of LDA is obtained, but the sum of the probabilities of all "topic-phrase" assignments needs to be removed, as shown in the following equation:

the method comprises the following steps of calculating the weighted sum of words with subject i in a document di, calculating the weighted sum of words with subject i in the document di, calculating the weighted sum of words with subject.

the word frequency weighting formula of the model is as follows:

Wherein ni represents the current word frequency, nmid represents the word frequency of the selected intermediate-frequency word, nmax represents the maximum value in the word frequency statistical result, nmin represents the minimum value in the word frequency statistical result, Ci represents the weight of the current word, the value range is [1, 2], in order to ensure that the number of the total feature words after weighting is unchanged, the weight of each feature word needs to be adjusted, wherein Fi is the weight of the feature words after adjustment, is the number of the current word, and is the weight sum of all the words. Referring to fig. 3, since the probability that a word w is assigned to a topic z when Gibbs samples initialize is random, Fi obtained by calculation is replaced by a random value initialized in the Gibbs sampling process, and on this basis, loop calculation is performed until convergence and a parameter and θ are obtained.

word vector characterization by word2vec method

word2vec is trained on a million-level dictionary and a billions of training corpora through deep learning, the training result is a Word vector model, and Word vectors effectively express semantic information of words in space. The training model of the vector refers to a shallow neural network CBOW or Skip-gram model, wherein the CBOW model is shown in FIG. 4.

the CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the words in the upper and lower contexts in an input window period, a Huffmam tree is constructed according to word frequency to obtain a Huffman path, the probability of a leaf node is calculated according to the path, then the parameters of the non-leaf node and the word vector of the context are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations are carried out.

weight assignment

Since the academic papers are generally divided into information such as titles, abstracts, keywords, contents, and the like. According to past experience, the title often contains the central thought of the full text and is an important summary of the full text content, so that the final weight of the words in the title is increased. The key word part has certain representative ability for the whole text subject matter, and the abstract part is considered as brief summary of the whole text content.

preferably, the weight assignment of the present invention is such that the weight of the title is set to 4, the weight of the keyword is set to 3, and the weight of the summary is set to 2.

Selection of FLDA model parameters

in the invention, 20 is selected as the optimal number of themes.

s3, Domain architecture mapping

Because the phrases of academic documents of various scholars obtained through the topic model have difference and can not be uniformly managed, an academic field system is introduced to realize the uniform measurement of the scholars.

the field system is established by referring to the national natural science fund field system, and can cover the research range of each field to the greatest extent.

The invention maps the result of the topic model into a field system, and the mapping formula is as follows:

F＝sim(A，B)*C*L

Wherein, A is the phrase obtained by the theme model, B is the system word, the corresponding word vector is obtained by using the vector model, and for the unknown word, the word vector is spliced by using the word vector. sim (A, B) is the cosine similarity of the final calculation. CA is the probability of topic model distribution, LA is the position coefficient of the phrase in the document, the value range is [2, 3, 4], and F (A, B) is the similarity obtained through weighting. CB is the final score of the system word.

S4, comprehensive sorting

and obtaining a final score CB through a mapping formula, sorting all system words corresponding to the current learner from high to low according to the score CB, and taking the system word with the highest scores of the first four items as a field label word which can represent the research field of the learner most.

By adopting the technical scheme, the invention achieves the following technical effects.

compared with the traditional LDA model, the TFIDF algorithm based on statistics and the TextRank algorithm based on a network graph, the FLDA model has the advantages that the finally obtained label word effect is better, the accuracy rate is higher, and the label extracting method based on the topic model has good applicability in the academic field.

Drawings

FIG. 1 is a schematic frame diagram of a topic model based domain tag acquisition method of the present invention;

FIG. 2 is a document-word frequency distribution graph;

FIG. 3 is a Gibbs sampling flow chart;

FIG. 4 is a schematic diagram of a CBOW model;

FIG. 5 is a graph of confusion versus topic number.

Detailed Description

in order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b):

A method for obtaining a domain label based on a topic model comprises the following steps:

s1, preprocessing data

Acquiring an initial data set, specifically, adopting the following method:

s11, data deduplication processing

In this embodiment, a cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence ratio:

and for the DOIs which are not the same or do not exist, firstly, judging the title by using word co-occurrence, continuously judging the co-occurrence times of the authors and the co-occurrence times of the keywords if the co-occurrence degree is more than 80%, and judging that the result is repeated and eliminated if the co-occurrence times of the authors is more than 1 and the key co-occurrence rate is more than 0.5.

the co-occurrence formula is as follows:

S12, word segmentation

S2, extracting keywords

extracting 'theme-phrase' through FLDA, carrying out weight assignment on the phrase according to the position appearing in the text, and carrying out vector characterization on the phrase by using word2 vec.

the "topic-phrase" extraction is performed by FLDA:

P(z＝j|z，w)→P(w|z＝j，z，w)P(z＝j|z)

Is a polynomial distribution of "topic-phrase" following the following formula:

The following formula is obtained by combining the formulas:

The word frequency weighting formula of the model is as follows:

word vector characterization by word2vec method

Word2vec is trained on a million-level dictionary and a billions of training corpora through deep learning, the training result is a Word vector model, and Word vectors effectively express semantic information of words in space.

Specifically, a CBOW model is adopted for training vectors. The CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the words in the upper and lower words in an input window period, a Huffmam tree is constructed according to word frequency to obtain a Huffman path, the probability of a leaf node is calculated according to the path, then the word vectors of parameters and contexts of the non-leaf node are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations.

Weight assignment

in this embodiment, the weight of the title is set to 4, the weight of the keyword is set to 3, and the weight of the summary is set to 2.

As a preferable scheme, 20 is selected as the optimal number of topics in this embodiment.

S3, Domain architecture mapping

In this embodiment, the result of the topic model is mapped into the domain system, and the mapping formula is as follows:

F＝sim(A，B)*C*L

S4, comprehensive sorting

And weighting and sequencing the vector representation result and the weight assignment result to obtain the label word which can represent the learner most.

specifically, the final score CB is obtained through a mapping formula, all system words corresponding to the current learner are ranked from high to low according to the scores, and the top four system words with the highest scores are taken as the field label words which can represent the research field of the learner most.

Experimental example:

In order to obtain experimental data which are as true as possible, the network crawler technology is used for crawling a thesis data source in a CNKI and a Wanfang database, the data are subjected to word segmentation processing by using a jieba, a word vector model published by Tencent AI laboratories is used for vector representation, and the experimental example is introduced from four aspects of evaluation standard introduction, data preprocessing, selection of FLDA model parameter theme numbers and evaluation based on a theme model label algorithm.

Evaluation criteria:

because the LDA topic model belongs to an unsupervised model, no more visual evaluation standard is used for measuring the quality of the model. In the experimental example, a theme-phrase matrix of a theme model is selected for evaluation, and the confusion degree is introduced as the evaluation standard of the model, and generally, the lower the confusion degree is, the better the effect of the model is. The calculation formula of the confusion degree is shown as a formula.

p(w)＝p(z|d)*p(w|z)。

where perplexity (d) represents the confusion of the current model, d is the academic document text, M represents the number of academic documents, the sum of all words in the current corpus, p (w) is the probability of the word w appearing in the matrix, p (z | d) represents the probability of the academic document d being the topic z, and p (w | z) represents the probability of the word w appearing in the topic z. The perplexity measures the conformity of the result predicted by the theme model and the original sample information.

In calculating the label accuracy, the accuracy of the label of the student is measured by using an F1 value, a plurality of students are randomly selected, and 4 most suitable labels are selected as correct labels by manually referring to the labels of the fields and combining the knowledge of the students. The first four labels and the correct label obtained by the algorithm sequencing are evaluated and calculated by using the indexes, and finally, an average F1 value is calculated, wherein the calculation formula is as follows:

Wherein hi represents the number of standard tags, mi represents the number of tags obtained by the algorithm, and hi n mi represents the correct number of tags obtained by the algorithm. And N is the total number of samples.

Data pre-processing

in order to eliminate the influence of repeated data caused by crawling of multiple data sources on a calculation result, data needs to be subjected to deduplication processing in a preprocessing stage, a cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence rate, whether DOIs of two texts to be compared are the same or not is judged firstly, direct filtering is performed on the same DOIs, the titles are judged firstly by using word co-occurrence if the DOIs are different or nonexistent, the number of times of co-occurrence of authors and the number of times of co-occurrence of keywords are continuously judged if the co-occurrence of authors is greater than 1 and the key co-occurrence rate is greater than 0.5, and the result is judged to be repeated and cleared if the number of times of co-occurrence of authors is greater than 1 and the key co-occurrence.

In the word segmentation stage, firstly extracting mass thesis keyword data and adding the mass thesis keyword data into a user dictionary of the word segmentation tool, simultaneously extracting key phrases before calculation by using the TextRank, and adding the key phrases into the user dictionary of the word segmentation tool. In addition, the whole corpus data is sorted according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.

Selection of FLDA model parameters

The selection of the number of the topics in the topic model is an important factor influencing the topic clustering result, if the number of the topics is set to be too small, the model clustering result is not differentiated, and if the number of the topics is set to be too large, the current document is wrongly divided into other topics. The experiment only changes the number of subjects under the condition that other parameters are not changed, and the obtained experiment result is shown in figure 4.

The LDA curve is a confusion curve of the LDA theme model, and the FLDA curve is a confusion curve of the LDA theme model weighted by word frequency. In the relationship diagram, the abscissa represents the number of subjects, and the ordinate represents the degree of confusion. In the experiment, the same group of parameters are used for repeating the experiment for three times, and the average value of the experiment results is taken.

From the experimental results, with the increase of the number of the topics, the puzzles of the three models are in a descending trend, and the descending amplitude is slowed down or even converged when the number of the topics is about 20, so that 20 is selected as the optimal number of the topics.

Evaluation of tag algorithms

A domain system is introduced to achieve the academic fields of the students, wherein the domain system is established by referring to the national science foundation domain system and is modified appropriately on the basis, and the domain system is shown in a table 1.

table 1 field system example

Because a relatively authoritative data set is lacked in the aspect of label extraction in the academic field, in order to verify the effectiveness of the algorithm, the experimental data uses academic paper data of 12 scholars, and the TFIDF algorithm, the TextRank algorithm, the LDA algorithm and the FLDA algorithm are respectively used for acquiring and comparing the academic labels of the scholars, and specific examples are shown in Table 2.

During evaluation of the algorithm, appropriate words in a system are manually selected according to information such as the introduction of the homepage of the learner, the introduction of the summons and the like as standard answers of the domain labels of the learner, label words obtained by the algorithm are called current answers, and the current answers and the standard answers are subjected to effect evaluation.

TABLE 2 Algorithm extraction tag result comparison

TABLE 3 Algorithm F1 value comparison

As can be seen from table 2, the label obtained by the FLDA algorithm has higher coincidence with the standard answer. The algorithm effect is better than LDA than TFIDF algorithm.

The data in table 3 were calculated with reference to equations (15), (16) and (17) to obtain F1 values, where 4-2 represents the 4 highest scoring labels obtained using the algorithm, calculated with two standard answers; 4-3 represents that 4 labels with the highest scores are obtained by using an algorithm and are calculated with three standard answers; 4-4 are calculated for four standard answers. Through analysis, the F1 value of FLDA is higher than that of the traditional LDA algorithm, the TFIDF algorithm based on statistics and the TextRank algorithm based on network graphs under various prediction conditions. The description shows that the FLDA model weighted by the multi-word-frequency characteristics can not only analyze the contents of documents and the relation between the contents of the documents from chapters, but also reduce the dimension of academic data, is more suitable for processing academic documents with a certain order of magnitude, and is beneficial to subsequent label mapping and calculation. The model can reflect the research direction of the scholars to a certain extent, so that the user can conveniently and comprehensively know the scholars, and the time and the energy of the user are saved. The fact that the FLDA algorithm weighted by the multi-word frequency features can better extract the key information in the academic text compared with the traditional algorithm is indirectly shown.

The technical solution provided by the present invention is not limited by the above embodiments, and all technical solutions formed by utilizing the structure and the mode of the present invention through conversion and substitution are within the protection scope of the present invention.

Claims

1. A method for obtaining a domain label based on a topic model is characterized by comprising the following steps:

S1, preprocessing data

Acquiring an initial data set;

s2, extracting keywords

s3, Domain architecture mapping

S4, comprehensive sorting

2. The method of claim 1, wherein the method comprises:

S1, preprocessing data including S11, data de-duplication processing, and S12, word segmentation;

s11, data deduplication processing is carried out to obtain a student document set;

A cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence rate:

For DOIs which are different or do not exist, firstly, judging the title by using word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of authors and the co-occurrence times of keywords, and if the co-occurrence times of authors is more than 1 and the co-occurrence rate of keywords is more than 0.5, judging that the result is repeated and clearing;

the co-occurrence formula is as follows:

A, B is the length of two sets of title words, len (A) is the length of the set of title words, len (B) is the length of the set of title words, len (A ^ B) is the length of the intersection of two sets of title words, min { len (A), len (B) } is the minimum value of the two lengths;

S12, performing word segmentation to obtain an initial data set;

firstly, extracting paper keyword data and adding the paper keyword data into a user dictionary of a word segmentation tool, simultaneously extracting key phrases by using a TextRank before calculation, and adding the key phrases into the user dictionary of the word segmentation tool;

And sequencing the whole corpus data according to word frequency, manually screening high-frequency irrelevant words, and adding the irrelevant words into a stop word list of the word segmentation tool.

3. the method of claim 1, wherein the method comprises: the method for extracting the subject-phrase through FLDA comprises the steps of obtaining sampling parameters and theta through Gibbs sampling,

the posterior formula of zi is shown as follows:

P(z＝j|z，w)→P(w|z＝j，z，w)P(z＝j|z)，

Where j is the word weight sum assigned to subject j to the current word Wi, z-i is the word weight sum assigned to non-zi, W-i is the word at the non-current position,

wherein, for the Gibbs sampling parameter corresponding to the current subject j, for integrating the parameter,

is a polynomial distribution of "topic-phrase" following the following formula:

In addition, at the same time, the prior distribution, and therefore the posterior probability is integrated, the following formula can be obtained:

Where is the sum of the weights of the words assigned to topic j and identical to word w, is the sum of the weights of all words assigned to topic j, β is a parameter of Dirichlet distribution, v is the size of the lexicon,

Representing the sum of the word weights assigned to topic i in di, T is the number of topics,

By the formula P (zi ═ j | z-i, w) → P (wi | zi ═ j, z-i, w-i) P (zi ═ j | z-i), the following formulae are combined:

from the above calculations, a non-standard distribution of LDA is obtained, and then the sum of the probabilities of all "topic-phrase" assignments is removed, as shown in the following equation:

The method comprises the following steps that wi, the ith word, zi is the word weight sum which is distributed to a current theme j to the current word wi, z-i is the word weight sum which is distributed to non-zi, the weight sum which represents that the theme is j and the word is the same as the word wi represents the weight sum of the word with the theme i in a document di, the weight sum which represents the word with the theme in the current document is obtained, V represents the size of a word bank, T represents the number of the themes, and P (zi is j | z-i, wi) is the recalculated posterior probability;

The word frequency weighting formula of the model is as follows:

wherein ni represents the current word frequency, nmid represents the word frequency of a selected intermediate-frequency word, nmax represents the maximum value in the word frequency statistical result, nmin represents the minimum value in the word frequency statistical result, Ci represents the weight of the current word, the value range is [1, 2], in order to ensure that the number of the total feature words after weighting is unchanged, the weight of each feature word needs to be adjusted, wherein Fi is the weight of the feature words after adjustment, is the number of the current word, and is the weight sum of all the words;

Replacing the computed Fi with a random value initialized in the Gibbs sampling process, and circularly computing to converge on the basis to obtain a parameter and theta.

4. The method of claim 3, wherein the method comprises: the vector characterization method is a word2vec method.

5. The method of claim 3, wherein the method comprises: the title weight is set to 4, the keyword weight to 3, and the summary weight to 2.

6. the method for obtaining a domain label based on a topic model according to any one of claims 3 to 5, wherein: the number of subjects of the FLDA model is 20.

7. The method of claim 1, wherein the method comprises: s3, the mapping formula of the domain system mapping is as follows:

F＝sim(A，B)*C*L

the method comprises the steps of obtaining phrases by using a topic model, obtaining system words by using a vector model, obtaining corresponding word vectors by using a word vector splicing method for unknown words, obtaining cosine similarity by using sim (A, B) and probability distributed by the topic model, obtaining position coefficients of the phrases in a document by using LA, wherein the value range is [2, 3, 4], F (A, B) is the similarity obtained by weighting, and CB is the final score of the system words.

8. The method of claim 1, wherein the method comprises: and S4, comprehensively sorting, sorting all system words corresponding to the current learner from high to low according to the score CB, and taking the several system words with the highest scores as the field label words which can represent the research field of the learner most.