CN110543564A - Method for acquiring domain label based on topic model - Google Patents

Method for acquiring domain label based on topic model Download PDF

Info

Publication number
CN110543564A
CN110543564A CN201910784200.3A CN201910784200A CN110543564A CN 110543564 A CN110543564 A CN 110543564A CN 201910784200 A CN201910784200 A CN 201910784200A CN 110543564 A CN110543564 A CN 110543564A
Authority
CN
China
Prior art keywords
word
words
model
topic
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910784200.3A
Other languages
Chinese (zh)
Other versions
CN110543564B (en
Inventor
黄改娟
王胜
张仰森
蒋玉茹
段瑞雪
张雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201910784200.3A priority Critical patent/CN110543564B/en
Publication of CN110543564A publication Critical patent/CN110543564A/en
Application granted granted Critical
Publication of CN110543564B publication Critical patent/CN110543564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for acquiring a domain label based on a topic model, which analyzes inherent characteristics of academic data on the basis of massive academic data, introduces academic word frequency characteristics to construct an FLDA topic model, and utilizes the topic model to extract the subject-phrase of academic documents of the same scholars. And secondly, introducing a field system, performing vector representation on the extraction result of the topic model and a system label, performing position weighting, and performing system mapping by using similarity to finally obtain the field label of the learner. Experiments show that compared with the traditional LDA model, the TFIDF algorithm based on statistics and the TextRank algorithm based on a network graph, the FLDA model has the advantages that the finally obtained label word effect is better, the accuracy rate is higher, and the label extraction method based on the topic model has good applicability in the academic field.

Description

Method for acquiring domain label based on topic model
Technical Field
The invention relates to a method for acquiring a domain label based on a topic model, in particular to a method for acquiring a domain label of a learner, and belongs to the technical field of information processing.
Background
The vigorous development of the economic society promotes various scientific and technological projects to be generated continuously, and the projects need participation of leading-edge scholars from project establishment, review and acceptance. In the past experience, the scholars are selected manually by a specially-assigned person, and the study fields of the scholars are counted manually, so that the scholars corresponding to the project fields are selected. However, the prior art methods tend to have the following disadvantages: a large number of projects exist in the same time, and the participation of students is required, so that the workload of artificial selection is increased invisibly; the human selection is easily influenced by subjectivity and local limitation of people, and is easily influenced by factors such as self knowledge level, social relation, personal preference and interests in the whole selection process, the judgment on the field of a student is incomplete, and the accuracy of the selection result is further influenced.
The field tag acquisition in the prior art is mainly divided into traditional field tag acquisition and field tag acquisition based on keywords.
in the traditional field, one method is to use the student's profile on each platform of the internet to extract what is called a web tag. Different from the extracted tags, the network tags are usually added by the user or others in a summary manner, and have no uniform standard, so that the obtained tags are complex and various and have low usability. In addition, because the internet content is random and has writing characteristics of authors, it is difficult to distinguish correct data information from useless information in the tag extraction process, and an extraction scheme with pertinence is designed for a specific platform and a specific learner during extraction, thereby increasing the workload invisibly.
The other method is based on ontology technology, a student research field management system of a P2P mode is designed, and the RDF technology is used for solving the problem acquired by the student field, but the method uses a specific template, so that the expansibility of the method is insufficient.
The method also comprises the steps of utilizing the J2EE technology to realize a student information management system, manually updating basic information, research fields and other information of a student, providing a consultation student recommendation module, and calculating the similarity between a user question text and the research fields of the student by using Pearson similarity to realize a student recommendation function.
in the field label extraction based on the keywords, there are a plurality of extraction methods, and the commonly used keyword extraction bases include statistics, topics, network diagrams and the like. Since the field is provided, related researchers have proposed various methods in succession, which are generally divided into two categories, namely supervised and unsupervised, wherein the supervised method requires artificial labeling of language materials and is applicable to the case of small texts, but the cost of artificial labeling is higher with the increase of mass internet data, and the method gradually turns to the unsupervised method in recent years. The core thought of the statistical method is to extract the statistical information of words in the text, and the method does not need training data and directly uses word frequency, positions and the like for judgment and screening. For example, one method is to modify the TFIDF result weight by using a weighting factor and a word contribution degree to improve the keyword extraction effect in the subdivided field; and the N-element language model directly uses word sequences for calculation, and can classify the documents in the field without performing operations such as word segmentation, feature extraction and the like on the documents, so as to find out the field labels of the trainees. The topic model utilizes probability distribution to realize keyword extraction, and the LDA (latent Dirichlet distribution) model is mainly popular at present. In order to research the behavior evolution process of a user, extracting the history interest of the user by using a static LDA and an improved fLDA to extract subject terms; in order to solve the problem of data sparseness in microblog hot events, time series characteristics and word frequency weighting characteristics are introduced into an LDA algorithm, and topic keywords obtained by the method have high interpretability and can well show contents shown by topics. The prior art further provides a topic clustering method based on LDA, the method firstly clusters the keywords obtained by LDA, and uses the result to optimize the topic obtained by LDA, and the method can effectively improve the accuracy and recall rate of the clustering result. In terms of network graphs, the TextRank algorithm adapted based on PageRank is the most well known. In order to solve the problem of poor extraction result of the academic keywords, the weight of the candidate result in the academic field is calculated by using priori knowledge, and then the candidate keywords are comprehensively ranked by combining with TextRank, so that the academic keywords with high correlation are finally obtained; and a probability migration matrix is constructed by utilizing the word vectors, so that the textrank algorithm is improved, and the performance of the algorithm is improved.
Unsupervised methods based on statistics, network diagrams and the like do not need manual labeling of the corpus in advance, but seriously depend on the effect and scale of the corpus, for example, the methods such as TFIDF and the like have simple structures, and extracted keywords lack information such as distribution conditions, semantics and the like. Although TextRank and other methods can obtain the distribution information of the keywords, the construction of the network graph requires a large amount of data to construct edges, and the extracted keywords lack topic relevance. Despite the above-mentioned disadvantages, the unsupervised approach still has advantages in terms of workload.
the invention is to use an extraction method based on a theme, regard an academic document set as a corpus to be extracted, and use an improved FLDA theme model to extract the corpus to obtain a theme distribution matrix, thereby realizing the automatic acquisition of tags.
Disclosure of Invention
in order to solve the problems in the prior art, the invention provides a method for acquiring a domain tag based on a topic model, which analyzes the inherent characteristics of academic data on the basis of mass academic data, introduces academic word frequency characteristics to construct an FLDA topic model, and utilizes the topic model to extract the subject-phrase of the academic document of the same learner. Secondly, introducing a domain system, performing vector representation on the extraction result of the topic model and a system label, performing position weighting, and performing system mapping by using similarity to finally obtain the domain label of the learner
in order to achieve the above technical object, the present invention adopts the following technical means.
a method for obtaining a domain label based on a topic model, referring to FIG. 1, includes the following steps:
s1, preprocessing data
acquiring an initial data set;
s2, extracting keywords
Extracting 'theme-phrase' through FLDA (flash memory access), carrying out weight assignment on the phrase according to the position where the phrase appears, and carrying out vector characterization on the phrase by using word2 vec;
S3, Domain architecture mapping
Mapping the subject-phrase to a system to realize the unified management of the field of scholars;
S4, comprehensive sorting
and (4) carrying out weighted sequencing on the vector characterization result and the weight assignment result, and obtaining the label word which can represent the learner most through a threshold value.
according to the domain label obtaining method based on the topic model, in particular, S1, data preprocessing comprises S11, data deduplication processing and S12, word segmentation is carried out.
specifically, in order to eliminate the influence of the repeated data caused by the crawling of multiple data sources on the calculation result, in the data preprocessing stage at S1, data deduplication processing needs to be performed at S11, so as to obtain a student document set.
The invention uses word co-occurrence, author contribution and keyword coincidence rate to construct a cleaning model:
For two texts to be compared, firstly judging whether DOIs of the two texts are the same, and directly filtering the same DOIs;
and for the DOIs which are not the same or do not exist, firstly, judging the title by using word co-occurrence, continuously judging the co-occurrence times of the authors and the co-occurrence times of the keywords if the co-occurrence degree is more than 80%, and judging that the result is repeated and eliminated if the co-occurrence times of the authors is more than 1 and the co-occurrence rate of the keywords is more than 0.5.
the co-occurrence formula is as follows:
A, B is the length of the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A ^ B) is the length after the intersection of the two title word sets, and min { len (A), len (B) } is the minimum value of the two lengths.
And S12, word segmentation.
In the word segmentation stage, the paper keyword data is extracted and added as a user dictionary of the word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.
in addition, the whole corpus data is sorted according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
according to the method for obtaining the domain label based on the topic model, in particular, in the keyword extraction, in step S2, "topic-phrase" extraction is performed through FLDA.
In the topic-based keyword extraction model, an LDA topic model is the mainstream, and the model considers that a batch of documents contains multiple topics, and each topic can be approximated by a series of phrases. A document is formed by selecting a topic with a certain probability, then selecting a phrase under the current topic with a certain probability, and repeating the process until a document is formed. The LDA topic extraction process is the inverse of the above process. The LDA topic model is generally widely applied in the news field, but in the scientific literature field, the topic modeling effect is affected by the special word frequency distribution of the scientific literature.
Through statistical analysis of academic documents of scholars, the frequency information of the academic documents is found to meet power function distribution, for example, fig. 2 shows the first 2000 high-frequency words of the academic documents, wherein the abscissa represents the corresponding sequence number of the high-frequency words after the high-frequency words are sorted according to the descending order of frequency, and the ordinate represents the frequency of the high-frequency words.
statistics shows that the words with the top 10% of the ranking frequency words occupy 81.1% of all academic document word sets, the Zipf distribution is met, and the word frequency researches find that the words which can represent the subject most often are not extremely high frequency words and extremely low frequency words but middle and high frequency words with higher frequency. If the document is directly extracted by using the LDA model, the loss of some intermediate frequency words can be caused, meanwhile, characteristic words with higher word frequency are usually paired and appear, and the probability that the high frequency words are distributed to topics is usually higher, so that the discrimination of each topic is not high, and in S1, although stop word filtering is performed in the data preprocessing process, complete filtering cannot be performed.
Therefore, the invention provides a word frequency weighted LDA topic model, which is characterized in that firstly, the word frequency information in a document is counted, the word frequency characteristics are introduced into the Gibbs sampling process, the influence of high-frequency words is reduced, the influence of medium-frequency characteristic words is improved, and an FLDA model is constructed, so that the model does not excessively emphasize high-frequency characteristic words. The FLDA model is as follows:
The LDA model obtains sampling parameters and theta through Gibbs sampling, and the parameters and the theta are obtained to construct a converged Markov chain so as to extract a proper sample from the Markov chain.
The distribution process of LDA to phrases is sampling zi. Wherein, the posterior formula of zi is shown as the following formula:
P(z=j|z,w)→P(w|z=j,z,w)P(z=j|z)
J is the word assigned the topic j to the current word Wi, z-i is the sum of the word weights assigned to non-zi, and W-i is the word at the non-current position.
P (w | z) is known to be only correlated, so by integrating over, we get the following:
for Gibbs sample parameters, for Gibbs sample parameters corresponding to the current topic j, for integrating the parameters,
Is a polynomial distribution of "topic-phrase" following the following formula:
In addition, at the same time, it is the prior distribution, so integrating the posterior probability can obtain the following formula:
Wherein, the sum of the weights of the words which are assigned to the subject j and are the same as the word w is the sum of the weights of all the words assigned to the subject j, β is a parameter of Dirichlet distribution, and v is the size of the lexicon.
Similarly, p (z) is related only to θ, so by integrating over θ, the following equation can be derived:
Represents the sum of the word weights assigned to topic i in di, and T is the number of topics.
the following formula is obtained by combining the formulas:
from the above calculations, a non-standard distribution of LDA is obtained, but the sum of the probabilities of all "topic-phrase" assignments needs to be removed, as shown in the following equation:
the method comprises the following steps of calculating the weighted sum of words with subject i in a document di, calculating the weighted sum of words with subject i in the document di, calculating the weighted sum of words with subject.
the word frequency weighting formula of the model is as follows:
Wherein ni represents the current word frequency, nmid represents the word frequency of the selected intermediate-frequency word, nmax represents the maximum value in the word frequency statistical result, nmin represents the minimum value in the word frequency statistical result, Ci represents the weight of the current word, the value range is [1, 2], in order to ensure that the number of the total feature words after weighting is unchanged, the weight of each feature word needs to be adjusted, wherein Fi is the weight of the feature words after adjustment, is the number of the current word, and is the weight sum of all the words. Referring to fig. 3, since the probability that a word w is assigned to a topic z when Gibbs samples initialize is random, Fi obtained by calculation is replaced by a random value initialized in the Gibbs sampling process, and on this basis, loop calculation is performed until convergence and a parameter and θ are obtained.
word vector characterization by word2vec method
word2vec is trained on a million-level dictionary and a billions of training corpora through deep learning, the training result is a Word vector model, and Word vectors effectively express semantic information of words in space. The training model of the vector refers to a shallow neural network CBOW or Skip-gram model, wherein the CBOW model is shown in FIG. 4.
the CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the words in the upper and lower contexts in an input window period, a Huffmam tree is constructed according to word frequency to obtain a Huffman path, the probability of a leaf node is calculated according to the path, then the parameters of the non-leaf node and the word vector of the context are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations are carried out.
weight assignment
Since the academic papers are generally divided into information such as titles, abstracts, keywords, contents, and the like. According to past experience, the title often contains the central thought of the full text and is an important summary of the full text content, so that the final weight of the words in the title is increased. The key word part has certain representative ability for the whole text subject matter, and the abstract part is considered as brief summary of the whole text content.
preferably, the weight assignment of the present invention is such that the weight of the title is set to 4, the weight of the keyword is set to 3, and the weight of the summary is set to 2.
Selection of FLDA model parameters
in the invention, 20 is selected as the optimal number of themes.
s3, Domain architecture mapping
Because the phrases of academic documents of various scholars obtained through the topic model have difference and can not be uniformly managed, an academic field system is introduced to realize the uniform measurement of the scholars.
the field system is established by referring to the national natural science fund field system, and can cover the research range of each field to the greatest extent.
The invention maps the result of the topic model into a field system, and the mapping formula is as follows:
F=sim(A,B)*C*L
Wherein, A is the phrase obtained by the theme model, B is the system word, the corresponding word vector is obtained by using the vector model, and for the unknown word, the word vector is spliced by using the word vector. sim (A, B) is the cosine similarity of the final calculation. CA is the probability of topic model distribution, LA is the position coefficient of the phrase in the document, the value range is [2, 3, 4], and F (A, B) is the similarity obtained through weighting. CB is the final score of the system word.
S4, comprehensive sorting
and obtaining a final score CB through a mapping formula, sorting all system words corresponding to the current learner from high to low according to the score CB, and taking the system word with the highest scores of the first four items as a field label word which can represent the research field of the learner most.
By adopting the technical scheme, the invention achieves the following technical effects.
compared with the traditional LDA model, the TFIDF algorithm based on statistics and the TextRank algorithm based on a network graph, the FLDA model has the advantages that the finally obtained label word effect is better, the accuracy rate is higher, and the label extracting method based on the topic model has good applicability in the academic field.
Drawings
FIG. 1 is a schematic frame diagram of a topic model based domain tag acquisition method of the present invention;
FIG. 2 is a document-word frequency distribution graph;
FIG. 3 is a Gibbs sampling flow chart;
FIG. 4 is a schematic diagram of a CBOW model;
FIG. 5 is a graph of confusion versus topic number.
Detailed Description
in order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example (b):
A method for obtaining a domain label based on a topic model comprises the following steps:
s1, preprocessing data
Acquiring an initial data set, specifically, adopting the following method:
s11, data deduplication processing
In this embodiment, a cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence ratio:
For two texts to be compared, firstly judging whether DOIs of the two texts are the same, and directly filtering the same DOIs;
and for the DOIs which are not the same or do not exist, firstly, judging the title by using word co-occurrence, continuously judging the co-occurrence times of the authors and the co-occurrence times of the keywords if the co-occurrence degree is more than 80%, and judging that the result is repeated and eliminated if the co-occurrence times of the authors is more than 1 and the key co-occurrence rate is more than 0.5.
the co-occurrence formula is as follows:
A, B is the length of the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A ^ B) is the length after the intersection of the two title word sets, and min { len (A), len (B) } is the minimum value of the two lengths.
S12, word segmentation
In the word segmentation stage, the paper keyword data is extracted and added as a user dictionary of the word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.
in addition, the whole corpus data is sorted according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
S2, extracting keywords
extracting 'theme-phrase' through FLDA, carrying out weight assignment on the phrase according to the position appearing in the text, and carrying out vector characterization on the phrase by using word2 vec.
the "topic-phrase" extraction is performed by FLDA:
The LDA model obtains sampling parameters and theta through Gibbs sampling, and the parameters and the theta are obtained to construct a converged Markov chain so as to extract a proper sample from the Markov chain.
the distribution process of LDA to phrases is sampling zi. Wherein, the posterior formula of zi is shown as the following formula:
P(z=j|z,w)→P(w|z=j,z,w)P(z=j|z)
j is the word assigned the topic j to the current word Wi, z-i is the sum of the word weights assigned to non-zi, and W-i is the word at the non-current position.
P (w | z) is known to be only correlated, so by integrating over, we get the following:
for Gibbs sample parameters, for Gibbs sample parameters corresponding to the current topic j, for integrating the parameters,
Is a polynomial distribution of "topic-phrase" following the following formula:
In addition, at the same time, it is the prior distribution, so integrating the posterior probability can obtain the following formula:
wherein, the sum of the weights of the words which are assigned to the subject j and are the same as the word w is the sum of the weights of all the words assigned to the subject j, β is a parameter of Dirichlet distribution, and v is the size of the lexicon.
similarly, p (z) is related only to θ, so by integrating over θ, the following equation can be derived:
Represents the sum of the word weights assigned to topic i in di, and T is the number of topics.
The following formula is obtained by combining the formulas:
From the above calculations, a non-standard distribution of LDA is obtained, but the sum of the probabilities of all "topic-phrase" assignments needs to be removed, as shown in the following equation:
The method comprises the following steps of calculating the weighted sum of words with subject i in a document di, calculating the weighted sum of words with subject i in the document di, calculating the weighted sum of words with subject.
The word frequency weighting formula of the model is as follows:
Wherein ni represents the current word frequency, nmid represents the word frequency of the selected intermediate-frequency word, nmax represents the maximum value in the word frequency statistical result, nmin represents the minimum value in the word frequency statistical result, Ci represents the weight of the current word, the value range is [1, 2], in order to ensure that the number of the total feature words after weighting is unchanged, the weight of each feature word needs to be adjusted, wherein Fi is the weight of the feature words after adjustment, is the number of the current word, and is the weight sum of all the words. Referring to fig. 3, since the probability that a word w is assigned to a topic z when Gibbs samples initialize is random, Fi obtained by calculation is replaced by a random value initialized in the Gibbs sampling process, and on this basis, loop calculation is performed until convergence and a parameter and θ are obtained.
word vector characterization by word2vec method
Word2vec is trained on a million-level dictionary and a billions of training corpora through deep learning, the training result is a Word vector model, and Word vectors effectively express semantic information of words in space.
Specifically, a CBOW model is adopted for training vectors. The CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the words in the upper and lower words in an input window period, a Huffmam tree is constructed according to word frequency to obtain a Huffman path, the probability of a leaf node is calculated according to the path, then the word vectors of parameters and contexts of the non-leaf node are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations.
Weight assignment
in this embodiment, the weight of the title is set to 4, the weight of the keyword is set to 3, and the weight of the summary is set to 2.
As a preferable scheme, 20 is selected as the optimal number of topics in this embodiment.
S3, Domain architecture mapping
Mapping the subject-phrase to a system to realize the unified management of the field of scholars;
In this embodiment, the result of the topic model is mapped into the domain system, and the mapping formula is as follows:
F=sim(A,B)*C*L
Wherein, A is the phrase obtained by the theme model, B is the system word, the corresponding word vector is obtained by using the vector model, and for the unknown word, the word vector is spliced by using the word vector. sim (A, B) is the cosine similarity of the final calculation. CA is the probability of topic model distribution, LA is the position coefficient of the phrase in the document, the value range is [2, 3, 4], and F (A, B) is the similarity obtained through weighting. CB is the final score of the system word.
S4, comprehensive sorting
And weighting and sequencing the vector representation result and the weight assignment result to obtain the label word which can represent the learner most.
specifically, the final score CB is obtained through a mapping formula, all system words corresponding to the current learner are ranked from high to low according to the scores, and the top four system words with the highest scores are taken as the field label words which can represent the research field of the learner most.
Experimental example:
In order to obtain experimental data which are as true as possible, the network crawler technology is used for crawling a thesis data source in a CNKI and a Wanfang database, the data are subjected to word segmentation processing by using a jieba, a word vector model published by Tencent AI laboratories is used for vector representation, and the experimental example is introduced from four aspects of evaluation standard introduction, data preprocessing, selection of FLDA model parameter theme numbers and evaluation based on a theme model label algorithm.
Evaluation criteria:
because the LDA topic model belongs to an unsupervised model, no more visual evaluation standard is used for measuring the quality of the model. In the experimental example, a theme-phrase matrix of a theme model is selected for evaluation, and the confusion degree is introduced as the evaluation standard of the model, and generally, the lower the confusion degree is, the better the effect of the model is. The calculation formula of the confusion degree is shown as a formula.
p(w)=p(z|d)*p(w|z)。
where perplexity (d) represents the confusion of the current model, d is the academic document text, M represents the number of academic documents, the sum of all words in the current corpus, p (w) is the probability of the word w appearing in the matrix, p (z | d) represents the probability of the academic document d being the topic z, and p (w | z) represents the probability of the word w appearing in the topic z. The perplexity measures the conformity of the result predicted by the theme model and the original sample information.
In calculating the label accuracy, the accuracy of the label of the student is measured by using an F1 value, a plurality of students are randomly selected, and 4 most suitable labels are selected as correct labels by manually referring to the labels of the fields and combining the knowledge of the students. The first four labels and the correct label obtained by the algorithm sequencing are evaluated and calculated by using the indexes, and finally, an average F1 value is calculated, wherein the calculation formula is as follows:
Wherein hi represents the number of standard tags, mi represents the number of tags obtained by the algorithm, and hi n mi represents the correct number of tags obtained by the algorithm. And N is the total number of samples.
Data pre-processing
in order to eliminate the influence of repeated data caused by crawling of multiple data sources on a calculation result, data needs to be subjected to deduplication processing in a preprocessing stage, a cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence rate, whether DOIs of two texts to be compared are the same or not is judged firstly, direct filtering is performed on the same DOIs, the titles are judged firstly by using word co-occurrence if the DOIs are different or nonexistent, the number of times of co-occurrence of authors and the number of times of co-occurrence of keywords are continuously judged if the co-occurrence of authors is greater than 1 and the key co-occurrence rate is greater than 0.5, and the result is judged to be repeated and cleared if the number of times of co-occurrence of authors is greater than 1 and the key co-occurrence.
In the word segmentation stage, firstly extracting mass thesis keyword data and adding the mass thesis keyword data into a user dictionary of the word segmentation tool, simultaneously extracting key phrases before calculation by using the TextRank, and adding the key phrases into the user dictionary of the word segmentation tool. In addition, the whole corpus data is sorted according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
Selection of FLDA model parameters
The selection of the number of the topics in the topic model is an important factor influencing the topic clustering result, if the number of the topics is set to be too small, the model clustering result is not differentiated, and if the number of the topics is set to be too large, the current document is wrongly divided into other topics. The experiment only changes the number of subjects under the condition that other parameters are not changed, and the obtained experiment result is shown in figure 4.
The LDA curve is a confusion curve of the LDA theme model, and the FLDA curve is a confusion curve of the LDA theme model weighted by word frequency. In the relationship diagram, the abscissa represents the number of subjects, and the ordinate represents the degree of confusion. In the experiment, the same group of parameters are used for repeating the experiment for three times, and the average value of the experiment results is taken.
From the experimental results, with the increase of the number of the topics, the puzzles of the three models are in a descending trend, and the descending amplitude is slowed down or even converged when the number of the topics is about 20, so that 20 is selected as the optimal number of the topics.
Evaluation of tag algorithms
A domain system is introduced to achieve the academic fields of the students, wherein the domain system is established by referring to the national science foundation domain system and is modified appropriately on the basis, and the domain system is shown in a table 1.
table 1 field system example
Because a relatively authoritative data set is lacked in the aspect of label extraction in the academic field, in order to verify the effectiveness of the algorithm, the experimental data uses academic paper data of 12 scholars, and the TFIDF algorithm, the TextRank algorithm, the LDA algorithm and the FLDA algorithm are respectively used for acquiring and comparing the academic labels of the scholars, and specific examples are shown in Table 2.
During evaluation of the algorithm, appropriate words in a system are manually selected according to information such as the introduction of the homepage of the learner, the introduction of the summons and the like as standard answers of the domain labels of the learner, label words obtained by the algorithm are called current answers, and the current answers and the standard answers are subjected to effect evaluation.
TABLE 2 Algorithm extraction tag result comparison
TABLE 3 Algorithm F1 value comparison
As can be seen from table 2, the label obtained by the FLDA algorithm has higher coincidence with the standard answer. The algorithm effect is better than LDA than TFIDF algorithm.
The data in table 3 were calculated with reference to equations (15), (16) and (17) to obtain F1 values, where 4-2 represents the 4 highest scoring labels obtained using the algorithm, calculated with two standard answers; 4-3 represents that 4 labels with the highest scores are obtained by using an algorithm and are calculated with three standard answers; 4-4 are calculated for four standard answers. Through analysis, the F1 value of FLDA is higher than that of the traditional LDA algorithm, the TFIDF algorithm based on statistics and the TextRank algorithm based on network graphs under various prediction conditions. The description shows that the FLDA model weighted by the multi-word-frequency characteristics can not only analyze the contents of documents and the relation between the contents of the documents from chapters, but also reduce the dimension of academic data, is more suitable for processing academic documents with a certain order of magnitude, and is beneficial to subsequent label mapping and calculation. The model can reflect the research direction of the scholars to a certain extent, so that the user can conveniently and comprehensively know the scholars, and the time and the energy of the user are saved. The fact that the FLDA algorithm weighted by the multi-word frequency features can better extract the key information in the academic text compared with the traditional algorithm is indirectly shown.
The technical solution provided by the present invention is not limited by the above embodiments, and all technical solutions formed by utilizing the structure and the mode of the present invention through conversion and substitution are within the protection scope of the present invention.

Claims (8)

1. A method for obtaining a domain label based on a topic model is characterized by comprising the following steps:
S1, preprocessing data
Acquiring an initial data set;
s2, extracting keywords
Extracting 'theme-phrase' through FLDA (flash memory access), carrying out weight assignment on the phrase according to the position where the phrase appears, and carrying out vector characterization on the phrase by using word2 vec;
s3, Domain architecture mapping
Mapping the subject-phrase to a system to realize the unified management of the field of scholars;
S4, comprehensive sorting
and (4) carrying out weighted sequencing on the vector characterization result and the weight assignment result, and obtaining the label word which can represent the learner most through a threshold value.
2. The method of claim 1, wherein the method comprises:
S1, preprocessing data including S11, data de-duplication processing, and S12, word segmentation;
s11, data deduplication processing is carried out to obtain a student document set;
A cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence rate:
For two texts to be compared, firstly judging whether DOIs of the two texts are the same, and directly filtering the same DOIs;
For DOIs which are different or do not exist, firstly, judging the title by using word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of authors and the co-occurrence times of keywords, and if the co-occurrence times of authors is more than 1 and the co-occurrence rate of keywords is more than 0.5, judging that the result is repeated and clearing;
the co-occurrence formula is as follows:
A, B is the length of two sets of title words, len (A) is the length of the set of title words, len (B) is the length of the set of title words, len (A ^ B) is the length of the intersection of two sets of title words, min { len (A), len (B) } is the minimum value of the two lengths;
S12, performing word segmentation to obtain an initial data set;
firstly, extracting paper keyword data and adding the paper keyword data into a user dictionary of a word segmentation tool, simultaneously extracting key phrases by using a TextRank before calculation, and adding the key phrases into the user dictionary of the word segmentation tool;
And sequencing the whole corpus data according to word frequency, manually screening high-frequency irrelevant words, and adding the irrelevant words into a stop word list of the word segmentation tool.
3. the method of claim 1, wherein the method comprises: the method for extracting the subject-phrase through FLDA comprises the steps of obtaining sampling parameters and theta through Gibbs sampling,
the posterior formula of zi is shown as follows:
P(z=j|z,w)→P(w|z=j,z,w)P(z=j|z),
Where j is the word weight sum assigned to subject j to the current word Wi, z-i is the word weight sum assigned to non-zi, W-i is the word at the non-current position,
P (w | z) is known to be only correlated, so by integrating over, we get the following:
wherein, for the Gibbs sampling parameter corresponding to the current subject j, for integrating the parameter,
is a polynomial distribution of "topic-phrase" following the following formula:
In addition, at the same time, the prior distribution, and therefore the posterior probability is integrated, the following formula can be obtained:
Where is the sum of the weights of the words assigned to topic j and identical to word w, is the sum of the weights of all words assigned to topic j, β is a parameter of Dirichlet distribution, v is the size of the lexicon,
Similarly, p (z) is related only to θ, so by integrating over θ, the following equation can be derived:
Representing the sum of the word weights assigned to topic i in di, T is the number of topics,
By the formula P (zi ═ j | z-i, w) → P (wi | zi ═ j, z-i, w-i) P (zi ═ j | z-i), the following formulae are combined:
from the above calculations, a non-standard distribution of LDA is obtained, and then the sum of the probabilities of all "topic-phrase" assignments is removed, as shown in the following equation:
The method comprises the following steps that wi, the ith word, zi is the word weight sum which is distributed to a current theme j to the current word wi, z-i is the word weight sum which is distributed to non-zi, the weight sum which represents that the theme is j and the word is the same as the word wi represents the weight sum of the word with the theme i in a document di, the weight sum which represents the word with the theme in the current document is obtained, V represents the size of a word bank, T represents the number of the themes, and P (zi is j | z-i, wi) is the recalculated posterior probability;
The word frequency weighting formula of the model is as follows:
wherein ni represents the current word frequency, nmid represents the word frequency of a selected intermediate-frequency word, nmax represents the maximum value in the word frequency statistical result, nmin represents the minimum value in the word frequency statistical result, Ci represents the weight of the current word, the value range is [1, 2], in order to ensure that the number of the total feature words after weighting is unchanged, the weight of each feature word needs to be adjusted, wherein Fi is the weight of the feature words after adjustment, is the number of the current word, and is the weight sum of all the words;
Replacing the computed Fi with a random value initialized in the Gibbs sampling process, and circularly computing to converge on the basis to obtain a parameter and theta.
4. The method of claim 3, wherein the method comprises: the vector characterization method is a word2vec method.
5. The method of claim 3, wherein the method comprises: the title weight is set to 4, the keyword weight to 3, and the summary weight to 2.
6. the method for obtaining a domain label based on a topic model according to any one of claims 3 to 5, wherein: the number of subjects of the FLDA model is 20.
7. The method of claim 1, wherein the method comprises: s3, the mapping formula of the domain system mapping is as follows:
F=sim(A,B)*C*L
the method comprises the steps of obtaining phrases by using a topic model, obtaining system words by using a vector model, obtaining corresponding word vectors by using a word vector splicing method for unknown words, obtaining cosine similarity by using sim (A, B) and probability distributed by the topic model, obtaining position coefficients of the phrases in a document by using LA, wherein the value range is [2, 3, 4], F (A, B) is the similarity obtained by weighting, and CB is the final score of the system words.
8. The method of claim 1, wherein the method comprises: and S4, comprehensively sorting, sorting all system words corresponding to the current learner from high to low according to the score CB, and taking the several system words with the highest scores as the field label words which can represent the research field of the learner most.
CN201910784200.3A 2019-08-23 2019-08-23 Domain label acquisition method based on topic model Active CN110543564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910784200.3A CN110543564B (en) 2019-08-23 2019-08-23 Domain label acquisition method based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910784200.3A CN110543564B (en) 2019-08-23 2019-08-23 Domain label acquisition method based on topic model

Publications (2)

Publication Number Publication Date
CN110543564A true CN110543564A (en) 2019-12-06
CN110543564B CN110543564B (en) 2023-06-20

Family

ID=68712039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910784200.3A Active CN110543564B (en) 2019-08-23 2019-08-23 Domain label acquisition method based on topic model

Country Status (1)

Country Link
CN (1) CN110543564B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241283A (en) * 2020-01-15 2020-06-05 电子科技大学 Rapid characterization method for portrait of scientific research student
CN111831804A (en) * 2020-06-29 2020-10-27 深圳价值在线信息科技股份有限公司 Key phrase extraction method and device, terminal equipment and storage medium
CN112446204A (en) * 2020-12-07 2021-03-05 北京明略软件系统有限公司 Document tag determination method, system and computer equipment
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN112883148A (en) * 2021-01-15 2021-06-01 上海柏观数据科技有限公司 Subject talent evaluation control method and device based on research trend matching
CN113190672A (en) * 2021-05-12 2021-07-30 上海热血网络科技有限公司 Advertisement judgment model, advertisement filtering method and system
CN113298399A (en) * 2021-05-31 2021-08-24 西南大学 Scientific research project analysis method based on big data
CN114492425A (en) * 2021-12-30 2022-05-13 中科大数据研究院 Method for communicating multi-dimensional data by adopting one set of field label system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193535A1 (en) * 2014-01-07 2015-07-09 International Business Machines Corporation Identifying influencers for topics in social media
CN105740342A (en) * 2016-01-22 2016-07-06 天津中科智能识别产业技术研究院有限公司 Social relation topic model based social network friend recommendation method
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150193535A1 (en) * 2014-01-07 2015-07-09 International Business Machines Corporation Identifying influencers for topics in social media
CN105740342A (en) * 2016-01-22 2016-07-06 天津中科智能识别产业技术研究院有限公司 Social relation topic model based social network friend recommendation method
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIMING LI 等: "Supervised topic models for multi-label classification", 《NEUROCOMPUTING》 *
李熙铭: "基于主题模型的多标签文本分类和流文本数据建模若干问题研究", 《中国优秀博士学位论文库》 *
王胜 等: "基于SL-LDA的领域标签获取方法", 《计算机科学》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241283A (en) * 2020-01-15 2020-06-05 电子科技大学 Rapid characterization method for portrait of scientific research student
CN111831804A (en) * 2020-06-29 2020-10-27 深圳价值在线信息科技股份有限公司 Key phrase extraction method and device, terminal equipment and storage medium
CN111831804B (en) * 2020-06-29 2024-04-26 深圳价值在线信息科技股份有限公司 Method and device for extracting key phrase, terminal equipment and storage medium
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN112446204A (en) * 2020-12-07 2021-03-05 北京明略软件系统有限公司 Document tag determination method, system and computer equipment
CN112883148A (en) * 2021-01-15 2021-06-01 上海柏观数据科技有限公司 Subject talent evaluation control method and device based on research trend matching
CN113190672A (en) * 2021-05-12 2021-07-30 上海热血网络科技有限公司 Advertisement judgment model, advertisement filtering method and system
CN113298399A (en) * 2021-05-31 2021-08-24 西南大学 Scientific research project analysis method based on big data
CN114492425A (en) * 2021-12-30 2022-05-13 中科大数据研究院 Method for communicating multi-dimensional data by adopting one set of field label system

Also Published As

Publication number Publication date
CN110543564B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN110543564A (en) Method for acquiring domain label based on topic model
CN109492157B (en) News recommendation method and theme characterization method based on RNN and attention mechanism
Buber et al. Web page classification using RNN
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN108090070B (en) Chinese entity attribute extraction method
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN109408743B (en) Text link embedding method
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN109670014A (en) A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning
Rezaei et al. Multi-document extractive text summarization via deep learning approach
CN112559684A (en) Keyword extraction and information retrieval method
CN114706972B (en) Automatic generation method of unsupervised scientific and technological information abstract based on multi-sentence compression
CN110705247A (en) Based on x2-C text similarity calculation method
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN116304063B (en) Simple emotion knowledge enhancement prompt tuning aspect-level emotion classification method
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN112686025A (en) Chinese choice question interference item generation method based on free text
CN109614490A (en) Money article proneness analysis method based on LSTM
CN113032573B (en) Large-scale text classification method and system combining topic semantics and TF-IDF algorithm
CN111563361B (en) Text label extraction method and device and storage medium
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant