CN110543564B - Domain label acquisition method based on topic model - Google Patents

Domain label acquisition method based on topic model Download PDF

Info

Publication number
CN110543564B
CN110543564B CN201910784200.3A CN201910784200A CN110543564B CN 110543564 B CN110543564 B CN 110543564B CN 201910784200 A CN201910784200 A CN 201910784200A CN 110543564 B CN110543564 B CN 110543564B
Authority
CN
China
Prior art keywords
word
topic
model
words
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910784200.3A
Other languages
Chinese (zh)
Other versions
CN110543564A (en
Inventor
黄改娟
王胜
张仰森
蒋玉茹
段瑞雪
张雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201910784200.3A priority Critical patent/CN110543564B/en
Publication of CN110543564A publication Critical patent/CN110543564A/en
Application granted granted Critical
Publication of CN110543564B publication Critical patent/CN110543564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a field label acquisition method based on a topic model, which is characterized in that inherent characteristics of academic data are analyzed on the basis of massive academic data, an FLDA topic model is constructed by introducing academic word frequency characteristics, and academic documents of the same scholars are extracted by using the topic model. Secondly, introducing a domain system, carrying out vector characterization on the extraction result of the subject model and a system label, carrying out system mapping by using similarity after position weighting, and finally obtaining the domain label of the scholars. Experiments show that compared with the traditional LDA model, the TFIDF algorithm based on statistics and the textRank algorithm based on network diagrams, the FLDA model has better effect and higher accuracy of finally obtained tag words, and the tag extraction method based on the topic model has good applicability in the academic field.

Description

Domain label acquisition method based on topic model
Technical Field
The invention relates to a domain label acquisition method based on a topic model, in particular to a domain label acquisition method for a learner, and belongs to the technical field of information processing.
Background
The vigorous development of the economy and society promotes the continuous generation of various technological projects, and the projects need the participation of leading scholars from the stand, the review and the acceptance. In conventional experience, a learner often selects a study area by a person, and a learner corresponding to the project area is selected by manually counting the study area of each learner. However, the prior art methods often have the following drawbacks: the front scholars are required to participate in a large number of items in the same time, so that the workload of artificial selection is increased in an intangible way; the artificial selection is easily influenced by subjectivity and limitation of people, is easily influenced by own knowledge level, social relationship, personal preference, benefit and other factors in the whole selection process, and is not comprehensive in the field judgment of students, so that the accuracy of the selection result is influenced.
The prior art domain label acquisition is mainly divided into traditional domain label acquisition and domain label acquisition based on keywords.
In the traditional field, one method is to use the introduction of a learner at each platform of the internet to extract what is called a web label. The network labels are generally summarized and added by the inventor or other people, have no unified specification and are random in terms, so that the obtained labels are complex and various, and the usability is low. In addition, because the internet content is more random and has the writing characteristics of authors, correct data information and useless information are difficult to distinguish in the label extraction process, a specific extraction scheme is designed aiming at a specific platform and a specific scholars in the extraction process, and the workload is increased in an intangible way.
Another method is to design a domain management system for studying a learner in a P2P mode based on an ontology technology, and solve the problem of obtaining the domain of the learner by using an RDF technology, but the method uses a specific template, so that the expansibility of the method is insufficient.
The method is characterized in that a J2EE technology is utilized to realize a student information management system, basic information of a student, information of research fields and the like are updated manually, a consultation student recommendation module is provided, similarity between a user problem text and the research fields of the student is calculated by using Pearson similarity, and a student recommendation function is realized.
In terms of keyword-based domain label extraction, there are various extraction methods, and the common keyword extraction basis includes statistics, topics, network diagrams and the like. Keywords are a series of words with high generalization to articles, automatic keyword extraction is a technology for identifying representative words or phrases in texts, since the field is proposed, related researchers have sequentially proposed various methods, and generally classified into two categories of supervised and unsupervised, wherein the supervised method needs manual labeling of corpus, and is applicable to small texts, but with the increase of massive internet data, the manual labeling cost is higher and higher, and the method gradually shifts to an unsupervised method in recent years. The core idea of the statistical method is to extract the statistical information of words in the text, and the method does not need training data, and directly uses word frequency, position and the like to judge and screen. For example, one method is to correct TFIDF result weights using weighting factors and word contribution to promote keyword extraction in the subdivision domain; the method also comprises the steps of utilizing the N-element language model and the document weight to merge to realize automatic recognition of the domain of the learner, directly using word sequences to calculate the N-element language model, classifying the domain of the document without the operations of word segmentation, feature extraction and the like, and further finding out the domain label of the learner. The topic model is to realize keyword extraction by using probability distribution, and is currently mainly popular as an LDA (latent Di Ke Lee distribution) model. Extracting subject words by using static LDA and improved fLDA to extract historical interests of a user in order to study the behavior evolution process of the user; in order to reduce the problem of sparse data in microblog hot events, time sequence features and word frequency weighting features are introduced into an LDA algorithm, topic keywords obtained by the method have high interpretability, and content displayed by topics can be well indicated. The prior art further provides a topic clustering method based on LDA, the method firstly clusters keywords obtained by the LDA, and results are used for optimizing the results of topics obtained by the LDA, so that the accuracy and recall rate of clustering results can be effectively improved. In terms of network diagrams, textRank algorithm adapted based on PageRank is the most well known. The method also solves the problem of poor academic keyword extraction results, calculates the weight of candidate results in the academic field by using priori knowledge, and then comprehensively sorts the candidate keywords by combining with TextRank to finally obtain academic keywords with higher correlation degree; and constructing a probability offset matrix by using the word vector, improving the texttrank algorithm and improving the performance of the algorithm.
The unsupervised methods based on statistics, network diagrams and the like do not need to manually label the language materials in advance, but the unsupervised methods depend on the effect and scale of a corpus seriously, for example, the methods such as TFIDF and the like have simple structures, and extracted keywords lack information such as distribution conditions and semantics. Although the TextRank method can acquire the distribution information of the keywords, the network diagram construction requires a large amount of data to form edges, and the extracted keywords lack subject relevance. Despite the above drawbacks, the unsupervised approach still has advantages in terms of workload.
According to the method, an extraction method based on a theme is used, an academic document set is regarded as a corpus to be extracted, and an improved FLDA theme model is used for extracting the theme-phrase to obtain a theme distribution matrix, so that automatic acquisition of labels is realized.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a field label acquisition method based on a topic model, which is characterized by analyzing inherent characteristics of academic data on the basis of massive academic data, introducing academic word frequency characteristics to construct an FLDA topic model, and extracting academic documents of the same scholars by using the topic model. Secondly, introducing a domain system, carrying out vector characterization on the extraction result of the subject model and a system label, carrying out system mapping by using similarity after position weighting, and finally obtaining the domain label of the scholars
In order to achieve the technical purpose, the invention adopts the following technical scheme.
Referring to fig. 1, a topic model-based domain label acquisition method includes the following steps:
s1, data preprocessing
Acquiring an initial data set;
s2, keyword extraction
Extracting a 'theme-phrase' through the FLDA, carrying out weight assignment on the phrase according to the position appearing therein, and carrying out vector characterization on the phrase by using word2 vec;
s3, mapping field system
Mapping the theme-phrase to a system to realize unified management of the domain of the scholars;
s4, comprehensive sequencing
And (5) weighting and sequencing the vector characterization result and the weight assignment result, and obtaining the tag word which can best represent the scholars through a threshold value.
According to the aforementioned method for acquiring the domain label based on the topic model, specifically, S1, data preprocessing includes S11, data deduplication processing, and S12, word segmentation.
Specifically, in order to eliminate the influence of repeated data caused by multi-data source crawling on the calculation result, in the step S1, data preprocessing is required to be performed in the step S11, and data deduplication processing is required to obtain a document set of a learner.
The invention uses word co-occurrence, author contribution and keyword coincidence rate to construct a cleaning model:
for two texts to be compared, firstly judging whether the DOI is the same or not, and directly filtering the same DOI;
and for the fact that the DOI is different or does not exist, firstly judging the title by using the word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of the authors and the co-occurrence times of the keywords, and if the co-occurrence times of the authors are more than 1 and the co-occurrence rate of the keywords is more than 0.5, repeating and clearing the judging result.
The co-occurrence formula is shown as follows:
Figure GDA0002235565440000031
wherein A, B is the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A.u.B) is the length of the intersection of the two sets of title words, and min { len (A), len (B) } is the minimum of the two lengths.
S12, word segmentation.
In the word segmentation stage, firstly, the paper keyword data are extracted and added into a user dictionary of a word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.
In addition, the whole corpus data is ordered according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
According to the aforementioned method for obtaining a domain label based on a topic model, specifically, S2, in keyword extraction, extraction of a "topic-phrase" is performed by FLDA.
In the prior art, in a topic-based keyword extraction model, an LDA topic model is mainly used, and the model considers that a batch of documents contain a plurality of topics, and each topic can be approximated by a plurality of column phrases to represent the current topic. A document is formed by selecting a topic with a certain probability, then selecting phrases under the current topic with a certain probability, and repeating the process until a document is formed. The LDA topic extraction process is the reverse of the above process. LDA topic models are generally widely used in news fields, however in technical literature fields, topic modeling effects are affected by the distribution of special word frequencies of technical literature.
Through statistical analysis of academic documents of scholars, the frequency information of the academic documents is found to meet power function distribution, as shown in fig. 2, the top 2000 high-frequency words of the academic documents are shown, wherein the abscissa represents the corresponding sequence numbers after the high-frequency words are sorted according to the descending order of frequency, and the ordinate represents the frequency of the high-frequency words.
Statistics shows that the words with the top 10% of the frequency word ranks occupy 81.1% of the word set of all academic documents, accords with Zipf distribution, and research on word frequency finds that words which can most represent the subject are often not very high frequency words and very low frequency words, but medium and high frequency words with higher frequency. If the LDA model is directly used for extracting the document, the deletion of certain medium frequency words is caused, meanwhile, the characteristic words with higher word frequency usually appear in a combination way, and the probability that the high frequency words are distributed to the topics is relatively high, so that the distinction degree of each topic is not high, and in S1, the complete filtering can not be realized although the filtering of stop words is carried out in the data preprocessing process.
Therefore, the invention provides the word frequency weighted LDA topic model, firstly, word frequency information in a document is counted, word frequency characteristics are introduced into the Gibbs sampling process, the influence of high-frequency words is reduced, the influence of medium-frequency characteristic words is improved, and the FLDA model is constructed, so that the model is excessively heavier than the high-frequency characteristic word words. The FLDA model is as follows:
obtaining sampling parameters by using Gibbs sampling through LDA model
Figure GDA0002235565440000041
And θ, acquisition parameter->
Figure GDA0002235565440000042
And θ is to construct a converging markov chain from which appropriate samples are extracted.
The distribution process of LDA to phrase is z i Is a sample of (a). Wherein z is i The posterior formula of (2) is as follows:
P(z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i )
z i =j is assigning topic j to the current word W i ,z -i To be allocated to non-z i Word weight sum, W of (2) -i Is a word that is not the current location.
P (w|z) is known to be the only
Figure GDA0002235565440000043
Correlation, thus by->
Figure GDA0002235565440000044
Integrating above to obtain the following formula:
Figure GDA0002235565440000045
Figure GDA0002235565440000046
for Gibbs sampling parameter, < >>
Figure GDA0002235565440000047
Gibbs sampling parameter corresponding to current topic j, < ->
Figure GDA0002235565440000048
In order to integrate the parameters of the device,
Figure GDA0002235565440000051
a polynomial distribution that is a "topic-phrase" follows the following formula:
Figure GDA0002235565440000052
in addition, in the case of the optical fiber,
Figure GDA0002235565440000053
at the same time (I)>
Figure GDA0002235565440000054
Is->
Figure GDA0002235565440000055
Is thus +.>
Figure GDA0002235565440000056
Integrating to obtain the following formula:
Figure GDA0002235565440000057
wherein ,
Figure GDA0002235565440000058
is the sum of the weights assigned to topic j and the same word as word w, ++>
Figure GDA0002235565440000059
For the sum of weights of all words assigned to topic j, β is the parameter of the Dirichlet distribution, v is the size of the word stock.
Similarly, P (z) is related to θ only, so the following formula can be obtained by integrating over θ:
Figure GDA00022355654400000510
Figure GDA00022355654400000511
representation d i The term weight sum assigned to the topic i, T is the topic number.
By the formula P (z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ),
Figure GDA00022355654400000512
Figure GDA00022355654400000513
The combination gives the following formula:
Figure GDA00022355654400000514
through the above calculation, a non-standard distribution of LDA is obtained, but the probability sum of all "topic-phrase" assignments needs to be removed, as shown in the following formula:
Figure GDA00022355654400000515
wherein ,wi Ith word, z i =j is assigning the current topic j to the current word w i ,z -i To be allocated to non-z i Is used for the word weight sum of (a),
Figure GDA00022355654400000516
representing the topic j and the word w i Equal weight sum, ++>
Figure GDA00022355654400000517
Representing document d i Weight sum, ++of words with subject i>
Figure GDA00022355654400000518
The sum of weights of words with topics in the current document is represented, V represents the word stock size, T represents the number of topics, and P (z) i =j|z -i ,w i ) Is the recalculated posterior probability.
The word frequency weighting formula of the model is as follows:
Figure GDA0002235565440000061
Figure GDA0002235565440000062
wherein ,ni Representing the current word frequency, n mid Representing word frequency of selecting intermediate frequency word, n max Represents the maximum value, n, in word frequency statistics min Representing the minimum value in word frequency statistics, C i The weight of the current word is represented, and the value range is [1,2 ]]To ensure that the number of the weighted total feature words is unchanged, the weight of each feature word needs to be adjusted, wherein F i For the adjusted weights of the feature words,
Figure GDA0002235565440000063
for the number of occurrences of the current word, +.>
Figure GDA0002235565440000064
Is the sum of the weights of all words. Referring to FIG. 3, because the probability that the word w is assigned to the topic z when Gibbs samples initial is random, the calculated Fi is replacedThe random value initialized in the Gibbs sampling process is replaced, and the calculation is circularly carried out on the basis of the random value until convergence is achieved, and the parameter +.>
Figure GDA0002235565440000065
and θ.
word2vec method for word vector characterization
Word2vec is trained on a million-level dictionary and billions-level training corpus through deep learning, a training result is a Word vector model, and Word vectors effectively express Word semantic information in space. The training model of the vector refers to a shallow neural network CBOW or Skip-gram model, where the CBOW model is shown in fig. 4.
The CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the context words in an input window period, a Huffman path is obtained by constructing a Huffman tree according to word frequency, probability of leaf nodes is calculated according to the path, then parameters of non-leaf nodes and word vectors of the context are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations.
Weight assignment
Since academic papers are generally classified into information such as titles, summaries, keywords, contents, etc. According to past experience, the title often contains a central idea of the full text and is an important summary of the full text content, so that the invention increases the final weight of words in the title. The keyword part also has certain representative capability on the whole subject matter, and the abstract part is considered to be a brief summary of the whole subject matter.
Preferably, the weight assignment of the present invention resets the title weight to 4, the keyword weight to 3, and the abstract weight to 2.
Selection of FLDA model parameters
In the invention, 20 is selected as the optimal theme number.
S3, mapping field system
Because the academic documents of the students have differences among phrases obtained through the topic model, unified management of the students cannot be performed, and therefore unified measurement of the students is achieved by introducing an academic field system.
The field system is formulated by referring to the national natural science foundation field system, and can cover the research scope of each field to the greatest extent.
The invention maps the result of the topic model into the domain system, and the mapping formula is as follows:
Figure GDA0002235565440000071
F (A,B) =sim(A,B)*C A *L A
Figure GDA0002235565440000072
wherein A is a phrase obtained by the topic model, B is a systematic word, a corresponding word vector is obtained by using the vector model, and word vectors are spliced into word vectors by using word vectors for non-registered words. sim (a, B) is the final computed cosine similarity. C (C) A Probability assigned to topic model, L A The value range of the position coefficient of the phrase in the document is [2,3,4 ]],F (A,B) The weighted similarity is obtained. C (C) B The final score for the system word.
S4, comprehensive sequencing
Obtaining a final score C through a mapping formula B According to the score C, all system words corresponding to the current scholars B And (3) sorting from high to low, and taking the systematic words with the highest scores of the first four items as domain tag words which can most represent the research domain of the scholars.
By adopting the technical scheme, the following technical effects are achieved.
Compared with the traditional LDA model, the statistical-based TFIDF algorithm and the network graph-based textRank algorithm, the FLDA model has better effect and higher accuracy of finally obtained tag words, and the tag extraction method based on the topic model has good applicability in the academic field.
Drawings
FIG. 1 is a schematic frame diagram of a subject model-based domain label acquisition method of the present invention;
FIG. 2 is a document-word frequency distribution diagram;
FIG. 3 is a Gibbs sampling flow chart;
FIG. 4 is a schematic view of a CBOW model;
fig. 5 is a confusion-topic number relationship diagram.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
a topic model-based domain label acquisition method comprises the following steps:
s1, data preprocessing
The initial data set is acquired, specifically, the following method is adopted:
s11, data deduplication processing
The embodiment uses word co-occurrence, author contribution and keyword coincidence rate to construct a cleaning model:
for two texts to be compared, firstly judging whether the DOI is the same or not, and directly filtering the same DOI;
and for the DOI which is different or does not exist, firstly judging the title by using the word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of authors and the co-occurrence times of keywords, and if the co-occurrence times of the authors are more than 1 and the key co-occurrence rate is more than 0.5, repeating and clearing the judgment result.
The co-occurrence formula is shown as follows:
Figure GDA0002235565440000081
wherein A, B is the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A.u.B) is the length of the intersection of the two sets of title words, and min { len (A), len (B) } is the minimum of the two lengths.
S12, word segmentation
In the word segmentation stage, firstly, the paper keyword data are extracted and added into a user dictionary of a word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.
In addition, the whole corpus data is ordered according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
S2, keyword extraction
The "topic-phrase" extraction is performed by the FLDA, the phrases are weighted according to the locations where they appear in the text, and they are vector-characterized using word2 vec.
The "topic-phrase" extraction is performed by FLDA:
obtaining sampling parameters by using Gibbs sampling through LDA model
Figure GDA0002235565440000082
And θ, acquisition parameter->
Figure GDA0002235565440000083
And θ is to construct a converging markov chain from which appropriate samples are extracted.
The distribution process of LDA to phrase is z i Is a sample of (a). Wherein z is i The posterior formula of (2) is as follows:
P(z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i )
z i =j is assigning topic j to the current word W i ,z -i To be allocated to non-z i Word weight sum, W of (2) -i Is a word that is not the current location.
P (w|z) is known to be the only
Figure GDA0002235565440000091
Correlation, thus by->
Figure GDA0002235565440000092
Integrating above to obtain the following formula:
Figure GDA0002235565440000093
Figure GDA0002235565440000094
for Gibbs sampling parameter, < >>
Figure GDA0002235565440000095
Gibbs sampling parameter corresponding to current topic j, < ->
Figure GDA0002235565440000096
In order to integrate the parameters of the device,
Figure GDA0002235565440000097
a polynomial distribution that is a "topic-phrase" follows the following formula:
Figure GDA0002235565440000098
in addition, in the case of the optical fiber,
Figure GDA0002235565440000099
at the same time (I)>
Figure GDA00022355654400000910
Is->
Figure GDA00022355654400000911
Is thus +.>
Figure GDA00022355654400000912
Integrating to obtain the following formula:
Figure GDA00022355654400000913
wherein ,
Figure GDA00022355654400000914
is the sum of the weights assigned to topic j and the same word as word w, ++>
Figure GDA00022355654400000915
For the sum of weights of all words assigned to topic j, β is the parameter of the Dirichlet distribution, v is the size of the word stock.
Similarly, P (z) is related to θ only, so the following formula can be obtained by integrating over θ:
Figure GDA00022355654400000916
Figure GDA00022355654400000917
representation d i The term weight sum assigned to the topic i, T is the topic number.
By the formula P (z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ),
Figure GDA00022355654400000918
Figure GDA00022355654400000919
The combination gives the following formula:
Figure GDA00022355654400000920
through the above calculation, a non-standard distribution of LDA is obtained, but the probability sum of all "topic-phrase" assignments needs to be removed, as shown in the following formula:
Figure GDA0002235565440000101
wherein ,wi Ith word, z i =j is assigning the current topic j to the current word w i ,z -i To be allocated to non-z i Is used for the word weight sum of (a),
Figure GDA0002235565440000102
representing the topic j and the word w i Equal weight sum, ++>
Figure GDA0002235565440000103
Representing document d i Weight sum, ++of words with subject i>
Figure GDA0002235565440000104
The sum of weights of words with topics in the current document is represented, V represents the word stock size, T represents the number of topics, and P (z) i =j|z -i ,w i ) Is the recalculated posterior probability.
The word frequency weighting formula of the model is as follows:
Figure GDA0002235565440000105
Figure GDA0002235565440000106
wherein ,ni Representing the current word frequency, n mid Representing word frequency of selecting intermediate frequency word, n max Representing the maximum in word frequency statisticsValue n min Representing the minimum value in word frequency statistics, C i The weight of the current word is represented, and the value range is [1,2 ]]To ensure that the number of the weighted total feature words is unchanged, the weight of each feature word needs to be adjusted, wherein F i For the adjusted weights of the feature words,
Figure GDA0002235565440000107
for the number of occurrences of the current word, +.>
Figure GDA0002235565440000108
Is the sum of the weights of all words. Referring to fig. 3, since the probability that the word w is assigned to the topic z at the time of Gibbs sampling initial is random, fi calculated is replaced by a random value initialized during Gibbs sampling, and is circularly calculated on the basis of this until convergence and a parameter is obtained +>
Figure GDA0002235565440000109
and θ.
word2vec method for word vector characterization
Word2vec is trained on a million-level dictionary and billions-level training corpus through deep learning, a training result is a Word vector model, and Word vectors effectively express Word semantic information in space.
Specifically, a CBOW model is adopted for vector training. The CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the context words in an input window period, a Huffman path is obtained by constructing a Huffman tree according to word frequency, probability of leaf nodes is calculated according to the path, then parameters of non-leaf nodes and word vectors of the context are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations.
Weight assignment
In this embodiment, the title weight is set to 4, the keyword weight is set to 3, and the abstract weight is set to 2.
As a preferred solution, 20 is selected as the optimal number of topics in this embodiment.
S3, mapping field system
Mapping the theme-phrase to a system to realize unified management of the domain of the scholars;
in this embodiment, the result of the topic model is mapped into the domain system, and the mapping formula is as follows:
Figure GDA0002235565440000111
F (A,B) =sim(A,B)*C A *L A
Figure GDA0002235565440000112
wherein A is a phrase obtained by the topic model, B is a systematic word, a corresponding word vector is obtained by using the vector model, and word vectors are spliced into word vectors by using word vectors for non-registered words. sim (a, B) is the final computed cosine similarity. C (C) A Probability assigned to topic model, L A The value range of the position coefficient of the phrase in the document is [2,3,4 ]],F (A,B) The weighted similarity is obtained. C (C) B The final score for the system word.
S4, comprehensive sequencing
And (5) weighting and sequencing the vector characterization result and the weight assignment result to obtain the tag word which can represent the scholars most.
Specifically, a final score C is obtained through a mapping formula B And ordering all system words corresponding to the current scholars according to the scores from high to low, and taking the system word with the highest score of the first four items as the domain label word which can most represent the research domain of the scholars.
Experimental example:
in order to acquire experimental data as real as possible, the invention uses a web crawler technology to crawl CNKI and paper data sources in a mastership database, uses jieba for word segmentation processing of the data, uses a word vector model published by a Tencent AI laboratory for vector characterization, and introduces four aspects of evaluation criteria introduction, data preprocessing, selection of FLDA model parameter topic number and evaluation of a topic model-based label algorithm.
Evaluation criteria:
because the LDA topic model belongs to an unsupervised model, no visual evaluation standard is compared to measure the quality of the model. In the experimental example, a matrix of a theme-phrase of a theme model is selected for evaluation, and the confusion degree is introduced as an evaluation standard of the model, and generally, the lower the confusion degree is, the better the effect of the model is. The calculation formula of the confusion degree is shown as a formula.
Figure GDA0002235565440000113
p(w)=p(z|d)*p(w|z)。
Wherein Perplexity (D) represents the confusion of the current model, D table is academic document text, M represents the number of academic documents,
Figure GDA0002235565440000121
the sum of the numbers of all words in the current corpus, p (w) is the probability that word w appears in the matrix, p (z|d) represents the probability that the academic document d is the topic z, and p (w|z) represents the probability that word w appears in the topic z. The confusion measures the degree of coincidence of the result predicted by the topic model with the original sample information.
When the label accuracy is calculated, F1 values are used for measuring the accuracy of labels of scholars, a plurality of scholars are selected randomly, labels in the field are manually referred, and the 4 most suitable labels are selected as correct labels by combining knowledge of the scholars. The first four labels and the correct label obtained according to the algorithm sequence are evaluated and calculated by using the indexes, and finally an average F1 value is calculated, wherein the calculation formula is as follows:
Figure GDA0002235565440000122
Figure GDA0002235565440000123
Figure GDA0002235565440000124
wherein ,hi Represents the number of standard labels, m i Representing the number of labels obtained by the algorithm, h i ∩m i Indicating the correct number of labels the algorithm gets. N is the total number of samples.
Data preprocessing
In order to eliminate the influence of repeated data caused by crawling multiple data sources on a calculation result, the data is required to be subjected to de-duplication processing in a preprocessing stage, a cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence rate, for two texts to be compared, whether DOI is the same or not is firstly judged, direct filtering is carried out on the same DOI, and for the DOI which is different or does not exist, firstly, the title is judged by using word co-occurrence, if the co-occurrence degree is more than 80%, the author co-occurrence number and the keyword co-occurrence number are continuously judged, and if the author co-occurrence number is more than 1 and the keyword co-occurrence rate is more than 0.5, the judgment result is repeated and cleared.
In the word segmentation stage, massive paper keyword data are firstly extracted and added into a user dictionary of a word segmentation tool, meanwhile, a TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool. In addition, the whole corpus data is ordered according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
Selection of FLDA model parameters
The selection of the number of topics in the topic model is an important factor influencing the topic clustering result, if the number of topics is too small, the model clustering result is not distinguished, and if the number of topics is too large, the current document is divided into other topics in an error mode, so that the section calculates the confusion degree under different topic numbers through experiments, and determines the final topic number according to the confusion degree. The experiment only changed the number of subjects with the remaining parameters unchanged, and the experimental results obtained are shown in fig. 4.
The LDA curve is a confusion curve of the LDA topic model, and the FLDA curve is a confusion curve of the LDA topic model weighted by word frequency. The abscissa in the relation graph is the subject number, and the ordinate is the confusion degree. The same set of parameters is used for repeating the experiment three times, and the average value of the experimental results is taken.
According to experimental results, as the number of topics increases, the confusion degree of the three models is in a descending trend, and the descending amplitude is slowed down or even converged when the number of topics is about 20, so that 20 is selected as the optimal number of topics.
Evaluation of tag algorithm
The domain system is introduced to achieve the academic domain of the unified survey scholars, wherein the domain system is formulated with reference to the national natural science foundation domain system, and is appropriately modified on the basis, and the domain system is shown in table 1.
Table 1 field architecture example
Figure GDA0002235565440000131
Because of the lack of more authoritative data sets in the aspect of label extraction in the academic field, in order to verify the effectiveness of the algorithm, the experimental data of this time use academic paper data of 12 scholars, and respectively use a TFIDF algorithm, a textRank algorithm, an LDA algorithm and an FLDA algorithm to obtain academic labels of the scholars and compare the academic labels, and a specific example is shown in Table 2.
When the algorithm is evaluated, proper words in a system are selected manually according to the homepage introduction of a scholars, the brief seal is incurred and the like to serve as standard answers of labels in the field of the scholars, the label words obtained by the algorithm are called as current answers, and the effect evaluation is carried out on the current answers and the standard answers.
Table 2 algorithm extraction tag results comparison
Figure GDA0002235565440000132
Table 3 Algorithm F1 value comparison
Figure GDA0002235565440000133
Figure GDA0002235565440000141
As can be seen from table 2, the label obtained by FLDA algorithm has higher coincidence with the standard answer. The algorithm effect is better than that of LDA in TFIDF algorithm.
The data in Table 3 are calculated by referring to formulas (15) (16) (17) to obtain F1 value, wherein 4-2 represents the label with the highest score obtained by using the algorithm, and the label is calculated with two standard answers; 4-3 represents obtaining 4 labels with highest scores by using an algorithm, and calculating with three standard answers; 4-4 is calculated for four standard answers. It is analytically available that in many prediction cases, the F1 value of FLDA is higher than that of the conventional LDA algorithm, the statistical-based TFIDF algorithm, and the TextRank algorithm based on network map. The method and the device have the advantages that through introducing the FLDA model weighted by the multi-word frequency characteristics, not only can the content of the document and the relation between the content of the document be analyzed from the chapter and the relation, but also the academic data can be reduced in dimension, so that the method and the device are more suitable for processing academic documents with certain orders of magnitude, and are beneficial to subsequent label mapping and calculation. The model can reflect the research direction of the scholars to a certain extent, so that the user can conveniently and comprehensively understand the scholars, and the time and energy of the user are saved. The method also indirectly illustrates that the FLDA algorithm weighted by the multi-word frequency features can better extract key information in academic texts compared with the traditional algorithm.
The technical scheme provided by the invention is not limited by the embodiment, and all the technical schemes formed by using the structure and the mode of the invention through transformation and substitution are within the protection scope of the invention.

Claims (7)

1. A topic model-based domain label acquisition method is characterized by comprising the following steps of: comprising the steps of the following procedure of the method,
s1, data preprocessing
Acquiring an initial data set;
s2, keyword extraction
Extracting a 'theme-phrase' through the FLDA, carrying out weight assignment on the phrase according to the position appearing therein, and carrying out vector characterization on the phrase by using word2 vec;
s3, mapping field system
Mapping the theme-phrase to a system to realize unified management of the domain of the scholars;
s4, comprehensive sequencing
The vector characterization result and the weight assignment result are weighted and sequenced, and the label word which can represent the scholars best is obtained through a threshold value;
the method for extracting the topic-phrase through FLDA comprises the steps of obtaining sampling parameters through Gibbs sampling
Figure FDA0004014428710000011
and θ,
z i the posterior formula of (2) is as follows:
P(z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ),
wherein ,zi =j is assigning topic j to the current word W i ,z -i To be allocated to non-z i Word weight sum, W of (2) -i As a word that is not in the current position,
p (w|z) is known to be the only
Figure FDA0004014428710000012
Correlation, thus by->
Figure FDA0004014428710000013
Integrating above to obtain the following formula:
Figure FDA0004014428710000014
wherein ,
Figure FDA0004014428710000015
for Gibbs sampling parameter, < >>
Figure FDA0004014428710000016
Gibbs sampling parameter corresponding to current topic j, < ->
Figure FDA0004014428710000017
In order to integrate the parameters of the device,
Figure FDA0004014428710000018
a polynomial distribution that is a "topic-phrase" follows the following formula:
Figure FDA0004014428710000019
in addition, in the case of the optical fiber,
Figure FDA00040144287100000110
at the same time (I)>
Figure FDA00040144287100000111
Is->
Figure FDA00040144287100000112
Is thus +.>
Figure FDA00040144287100000113
Integrating to obtain the following formula:
Figure FDA00040144287100000114
wherein ,
Figure FDA00040144287100000115
is the sum of the weights assigned to topic j and the same word as word w, ++>
Figure FDA00040144287100000116
For the sum of the weights of all words assigned to the topic j, beta is the parameter of the Dirichlet distribution, v is the size of the word stock,
similarly, P (z) is related to θ only, so the following formula can be obtained by integrating over θ:
Figure FDA0004014428710000021
Figure FDA0004014428710000022
representation d i The sum of the word weights assigned to topic i, T is the topic number,
by the formula P (z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ),
Figure FDA0004014428710000023
Figure FDA0004014428710000024
The combination gives the following formula:
Figure FDA0004014428710000025
through the above calculation, a non-standard distribution of LDA is obtained, and then the probability sum of all the "topic-phrase" assignments is removed, as shown in the following formula:
Figure FDA0004014428710000026
wherein ,wi Ith word, z i =j is assigning the current topic j to the current word w i ,z -i To be allocated to non-z i Is used for the word weight sum of (a),
Figure FDA0004014428710000027
representing the topic j and the word w i Equal weight sum, ++>
Figure FDA0004014428710000028
Representing document d i Weight sum, ++of words with subject i>
Figure FDA0004014428710000029
The sum of weights of words with topics in the current document is represented, V represents the word stock size, T represents the number of topics, and P (z) i =j|z -i ,w i ) Is the recalculated posterior probability;
the word frequency weighting formula of the model is as follows:
Figure FDA00040144287100000210
Figure FDA00040144287100000211
wherein ,ni Representing the current word frequency, n mid Representing word frequency of selecting intermediate frequency word, n max Represents the maximum value, n, in word frequency statistics min Representing the minimum value in word frequency statistics, C i The weight of the current word is represented, and the value range is [1,2 ]]To ensure that the number of the weighted total feature words is unchanged, the weight of each feature word needs to be adjusted, wherein F i For the adjusted weights of the feature words,
Figure FDA00040144287100000212
for the number of occurrences of the current word, +.>
Figure FDA00040144287100000213
The sum of weights for all words;
replacing Fi obtained by calculation with random value initialized in Gibbs sampling process, and circularly calculating to convergence to obtain parameters
Figure FDA00040144287100000214
and θ.
2. The topic model-based domain tag acquisition method of claim 1, wherein:
s1, data preprocessing comprises S11, data duplication removal processing and S12, word segmentation;
s11, performing data deduplication processing to obtain a literature set of a learner;
constructing a cleaning model by using the co-occurrence of the words, the contribution of the authors and the coincidence rate of the keywords:
for two texts to be compared, firstly judging whether the DOI is the same or not, and directly filtering the same DOI;
for the fact that DOI is different or does not exist, firstly judging the title by using word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of authors and the co-occurrence times of keywords, and if the co-occurrence times of the authors are more than 1 and the co-occurrence rate of the keywords is more than 0.5, repeating and clearing the judging result;
the co-occurrence formula is shown as follows:
Figure FDA0004014428710000031
wherein A, B is the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A n B) is the length of the intersection of the two sets of title words, min { len (A), len (B) } is the minimum of the two lengths;
s12, word segmentation is carried out to obtain an initial data set;
firstly extracting paper keyword data and adding the paper keyword data into a user dictionary of a word segmentation tool, extracting key phrases before calculation by using a TextRank, and adding the key phrases into the user dictionary of the word segmentation tool;
and sorting the whole corpus data according to word frequency, manually screening high-frequency irrelevant words, and adding the irrelevant words into a stop word list of the word segmentation tool.
3. The topic model-based domain tag acquisition method of claim 1, wherein: the vector characterization method is word2vec method.
4. The topic model-based domain tag acquisition method of claim 1, wherein: the title weight is set to 4, the keyword weight is set to 3, and the abstract weight is set to 2.
5. The topic model-based domain label acquisition method according to any one of claims 1 to 4, wherein: the number of subjects of the FLDA model is 20.
6. The topic model-based domain tag acquisition method of claim 1, wherein: s3, a mapping formula of the domain system mapping is as follows:
Figure FDA0004014428710000032
F (A,B) =sim(A,B)*C A *L A
Figure FDA0004014428710000033
wherein A is a phrase obtained by a topic model, B is a systematic word, and a vector model is used for obtaining pairsCorresponding word vectors, for the unknown words, word vectors are spliced into word vectors by using word vectors, sim (A, B) is the cosine similarity calculated finally, C A Probability assigned to topic model, L A The value range of the position coefficient of the phrase in the document is [2,3,4 ]],F (A,B) For weighted similarity, C B The final score for the system word.
7. The topic model-based domain tag acquisition method of claim 1, wherein: s4, comprehensively sequencing, and scoring all system words corresponding to the current scholars according to the score C B And (5) sorting from high to low, and taking the number system words with the highest score as domain label words which can most represent the research domain of the scholars.
CN201910784200.3A 2019-08-23 2019-08-23 Domain label acquisition method based on topic model Active CN110543564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910784200.3A CN110543564B (en) 2019-08-23 2019-08-23 Domain label acquisition method based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910784200.3A CN110543564B (en) 2019-08-23 2019-08-23 Domain label acquisition method based on topic model

Publications (2)

Publication Number Publication Date
CN110543564A CN110543564A (en) 2019-12-06
CN110543564B true CN110543564B (en) 2023-06-20

Family

ID=68712039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910784200.3A Active CN110543564B (en) 2019-08-23 2019-08-23 Domain label acquisition method based on topic model

Country Status (1)

Country Link
CN (1) CN110543564B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241283B (en) * 2020-01-15 2023-04-07 电子科技大学 Rapid characterization method for portrait of scientific research student
CN111831804B (en) * 2020-06-29 2024-04-26 深圳价值在线信息科技股份有限公司 Method and device for extracting key phrase, terminal equipment and storage medium
CN112508376A (en) * 2020-11-30 2021-03-16 中国科学院深圳先进技术研究院 Index system construction method
CN112446204B (en) * 2020-12-07 2024-08-02 北京明略软件系统有限公司 Method, system and computer equipment for determining document label
CN112883148B (en) * 2021-01-15 2023-03-28 博观创新(上海)大数据科技有限公司 Subject talent evaluation control method and device based on research trend matching
CN113190672A (en) * 2021-05-12 2021-07-30 上海热血网络科技有限公司 Advertisement judgment model, advertisement filtering method and system
CN113298399B (en) * 2021-05-31 2023-04-07 西南大学 Scientific research project analysis method based on big data
CN113887198A (en) * 2021-10-11 2022-01-04 平安国际智慧城市科技股份有限公司 Project splitting method, device and equipment based on topic prediction and storage medium
CN114492425B (en) * 2021-12-30 2023-04-07 中科大数据研究院 Method for communicating multi-dimensional data by adopting one set of field label system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740342A (en) * 2016-01-22 2016-07-06 天津中科智能识别产业技术研究院有限公司 Social relation topic model based social network friend recommendation method
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9449096B2 (en) * 2014-01-07 2016-09-20 International Business Machines Corporation Identifying influencers for topics in social media

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740342A (en) * 2016-01-22 2016-07-06 天津中科智能识别产业技术研究院有限公司 Social relation topic model based social network friend recommendation method
CN109766544A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Document keyword abstraction method and device based on LDA and term vector

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Supervised topic models for multi-label classification;Ximing Li 等;《Neurocomputing》;20151231;全文 *
基于SL-LDA的领域标签获取方法;王胜 等;《计算机科学》;20201130;全文 *
基于主题模型的多标签文本分类和流文本数据建模若干问题研究;李熙铭;《中国优秀博士学位论文库》;20150815;全文 *

Also Published As

Publication number Publication date
CN110543564A (en) 2019-12-06

Similar Documents

Publication Publication Date Title
CN110543564B (en) Domain label acquisition method based on topic model
CN109492157B (en) News recommendation method and theme characterization method based on RNN and attention mechanism
CN110059311B (en) Judicial text data-oriented keyword extraction method and system
CN109858028B (en) Short text similarity calculation method based on probability model
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN109271477B (en) Method and system for constructing classified corpus by means of Internet
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
Wang et al. Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications
CN107220295A (en) A kind of people&#39;s contradiction reconciles case retrieval and mediation strategy recommends method
KR20020049164A (en) The System and Method for Auto - Document - classification by Learning Category using Genetic algorithm and Term cluster
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
CN114706972B (en) Automatic generation method of unsupervised scientific and technological information abstract based on multi-sentence compression
CN106776672A (en) Technology development grain figure determines method
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
CN112559684A (en) Keyword extraction and information retrieval method
CN109670014A (en) A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning
CN112035658A (en) Enterprise public opinion monitoring method based on deep learning
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN108647322A (en) The method that word-based net identifies a large amount of Web text messages similarities
CN113868387A (en) Word2vec medical similar problem retrieval method based on improved tf-idf weighting
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN115952292A (en) Multi-label classification method, device and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant