CN110543564B - Domain label acquisition method based on topic model - Google Patents
Domain label acquisition method based on topic model Download PDFInfo
- Publication number
- CN110543564B CN110543564B CN201910784200.3A CN201910784200A CN110543564B CN 110543564 B CN110543564 B CN 110543564B CN 201910784200 A CN201910784200 A CN 201910784200A CN 110543564 B CN110543564 B CN 110543564B
- Authority
- CN
- China
- Prior art keywords
- word
- topic
- model
- words
- phrase
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 239000013598 vector Substances 0.000 claims abstract description 38
- 238000000605 extraction Methods 0.000 claims abstract description 27
- 238000013507 mapping Methods 0.000 claims abstract description 16
- 238000012512 characterization method Methods 0.000 claims abstract description 11
- 230000011218 segmentation Effects 0.000 claims description 22
- 238000005070 sampling Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 8
- 238000011160 research Methods 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 4
- 230000009897 systematic effect Effects 0.000 claims description 4
- 239000013307 optical fiber Substances 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 abstract description 32
- 230000000694 effects Effects 0.000 abstract description 9
- 238000010586 diagram Methods 0.000 abstract description 8
- 238000002474 experimental method Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000009193 crawling Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000009394 selective breeding Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Educational Administration (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a field label acquisition method based on a topic model, which is characterized in that inherent characteristics of academic data are analyzed on the basis of massive academic data, an FLDA topic model is constructed by introducing academic word frequency characteristics, and academic documents of the same scholars are extracted by using the topic model. Secondly, introducing a domain system, carrying out vector characterization on the extraction result of the subject model and a system label, carrying out system mapping by using similarity after position weighting, and finally obtaining the domain label of the scholars. Experiments show that compared with the traditional LDA model, the TFIDF algorithm based on statistics and the textRank algorithm based on network diagrams, the FLDA model has better effect and higher accuracy of finally obtained tag words, and the tag extraction method based on the topic model has good applicability in the academic field.
Description
Technical Field
The invention relates to a domain label acquisition method based on a topic model, in particular to a domain label acquisition method for a learner, and belongs to the technical field of information processing.
Background
The vigorous development of the economy and society promotes the continuous generation of various technological projects, and the projects need the participation of leading scholars from the stand, the review and the acceptance. In conventional experience, a learner often selects a study area by a person, and a learner corresponding to the project area is selected by manually counting the study area of each learner. However, the prior art methods often have the following drawbacks: the front scholars are required to participate in a large number of items in the same time, so that the workload of artificial selection is increased in an intangible way; the artificial selection is easily influenced by subjectivity and limitation of people, is easily influenced by own knowledge level, social relationship, personal preference, benefit and other factors in the whole selection process, and is not comprehensive in the field judgment of students, so that the accuracy of the selection result is influenced.
The prior art domain label acquisition is mainly divided into traditional domain label acquisition and domain label acquisition based on keywords.
In the traditional field, one method is to use the introduction of a learner at each platform of the internet to extract what is called a web label. The network labels are generally summarized and added by the inventor or other people, have no unified specification and are random in terms, so that the obtained labels are complex and various, and the usability is low. In addition, because the internet content is more random and has the writing characteristics of authors, correct data information and useless information are difficult to distinguish in the label extraction process, a specific extraction scheme is designed aiming at a specific platform and a specific scholars in the extraction process, and the workload is increased in an intangible way.
Another method is to design a domain management system for studying a learner in a P2P mode based on an ontology technology, and solve the problem of obtaining the domain of the learner by using an RDF technology, but the method uses a specific template, so that the expansibility of the method is insufficient.
The method is characterized in that a J2EE technology is utilized to realize a student information management system, basic information of a student, information of research fields and the like are updated manually, a consultation student recommendation module is provided, similarity between a user problem text and the research fields of the student is calculated by using Pearson similarity, and a student recommendation function is realized.
In terms of keyword-based domain label extraction, there are various extraction methods, and the common keyword extraction basis includes statistics, topics, network diagrams and the like. Keywords are a series of words with high generalization to articles, automatic keyword extraction is a technology for identifying representative words or phrases in texts, since the field is proposed, related researchers have sequentially proposed various methods, and generally classified into two categories of supervised and unsupervised, wherein the supervised method needs manual labeling of corpus, and is applicable to small texts, but with the increase of massive internet data, the manual labeling cost is higher and higher, and the method gradually shifts to an unsupervised method in recent years. The core idea of the statistical method is to extract the statistical information of words in the text, and the method does not need training data, and directly uses word frequency, position and the like to judge and screen. For example, one method is to correct TFIDF result weights using weighting factors and word contribution to promote keyword extraction in the subdivision domain; the method also comprises the steps of utilizing the N-element language model and the document weight to merge to realize automatic recognition of the domain of the learner, directly using word sequences to calculate the N-element language model, classifying the domain of the document without the operations of word segmentation, feature extraction and the like, and further finding out the domain label of the learner. The topic model is to realize keyword extraction by using probability distribution, and is currently mainly popular as an LDA (latent Di Ke Lee distribution) model. Extracting subject words by using static LDA and improved fLDA to extract historical interests of a user in order to study the behavior evolution process of the user; in order to reduce the problem of sparse data in microblog hot events, time sequence features and word frequency weighting features are introduced into an LDA algorithm, topic keywords obtained by the method have high interpretability, and content displayed by topics can be well indicated. The prior art further provides a topic clustering method based on LDA, the method firstly clusters keywords obtained by the LDA, and results are used for optimizing the results of topics obtained by the LDA, so that the accuracy and recall rate of clustering results can be effectively improved. In terms of network diagrams, textRank algorithm adapted based on PageRank is the most well known. The method also solves the problem of poor academic keyword extraction results, calculates the weight of candidate results in the academic field by using priori knowledge, and then comprehensively sorts the candidate keywords by combining with TextRank to finally obtain academic keywords with higher correlation degree; and constructing a probability offset matrix by using the word vector, improving the texttrank algorithm and improving the performance of the algorithm.
The unsupervised methods based on statistics, network diagrams and the like do not need to manually label the language materials in advance, but the unsupervised methods depend on the effect and scale of a corpus seriously, for example, the methods such as TFIDF and the like have simple structures, and extracted keywords lack information such as distribution conditions and semantics. Although the TextRank method can acquire the distribution information of the keywords, the network diagram construction requires a large amount of data to form edges, and the extracted keywords lack subject relevance. Despite the above drawbacks, the unsupervised approach still has advantages in terms of workload.
According to the method, an extraction method based on a theme is used, an academic document set is regarded as a corpus to be extracted, and an improved FLDA theme model is used for extracting the theme-phrase to obtain a theme distribution matrix, so that automatic acquisition of labels is realized.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a field label acquisition method based on a topic model, which is characterized by analyzing inherent characteristics of academic data on the basis of massive academic data, introducing academic word frequency characteristics to construct an FLDA topic model, and extracting academic documents of the same scholars by using the topic model. Secondly, introducing a domain system, carrying out vector characterization on the extraction result of the subject model and a system label, carrying out system mapping by using similarity after position weighting, and finally obtaining the domain label of the scholars
In order to achieve the technical purpose, the invention adopts the following technical scheme.
Referring to fig. 1, a topic model-based domain label acquisition method includes the following steps:
s1, data preprocessing
Acquiring an initial data set;
s2, keyword extraction
Extracting a 'theme-phrase' through the FLDA, carrying out weight assignment on the phrase according to the position appearing therein, and carrying out vector characterization on the phrase by using word2 vec;
s3, mapping field system
Mapping the theme-phrase to a system to realize unified management of the domain of the scholars;
s4, comprehensive sequencing
And (5) weighting and sequencing the vector characterization result and the weight assignment result, and obtaining the tag word which can best represent the scholars through a threshold value.
According to the aforementioned method for acquiring the domain label based on the topic model, specifically, S1, data preprocessing includes S11, data deduplication processing, and S12, word segmentation.
Specifically, in order to eliminate the influence of repeated data caused by multi-data source crawling on the calculation result, in the step S1, data preprocessing is required to be performed in the step S11, and data deduplication processing is required to obtain a document set of a learner.
The invention uses word co-occurrence, author contribution and keyword coincidence rate to construct a cleaning model:
for two texts to be compared, firstly judging whether the DOI is the same or not, and directly filtering the same DOI;
and for the fact that the DOI is different or does not exist, firstly judging the title by using the word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of the authors and the co-occurrence times of the keywords, and if the co-occurrence times of the authors are more than 1 and the co-occurrence rate of the keywords is more than 0.5, repeating and clearing the judging result.
The co-occurrence formula is shown as follows:
wherein A, B is the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A.u.B) is the length of the intersection of the two sets of title words, and min { len (A), len (B) } is the minimum of the two lengths.
S12, word segmentation.
In the word segmentation stage, firstly, the paper keyword data are extracted and added into a user dictionary of a word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.
In addition, the whole corpus data is ordered according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
According to the aforementioned method for obtaining a domain label based on a topic model, specifically, S2, in keyword extraction, extraction of a "topic-phrase" is performed by FLDA.
In the prior art, in a topic-based keyword extraction model, an LDA topic model is mainly used, and the model considers that a batch of documents contain a plurality of topics, and each topic can be approximated by a plurality of column phrases to represent the current topic. A document is formed by selecting a topic with a certain probability, then selecting phrases under the current topic with a certain probability, and repeating the process until a document is formed. The LDA topic extraction process is the reverse of the above process. LDA topic models are generally widely used in news fields, however in technical literature fields, topic modeling effects are affected by the distribution of special word frequencies of technical literature.
Through statistical analysis of academic documents of scholars, the frequency information of the academic documents is found to meet power function distribution, as shown in fig. 2, the top 2000 high-frequency words of the academic documents are shown, wherein the abscissa represents the corresponding sequence numbers after the high-frequency words are sorted according to the descending order of frequency, and the ordinate represents the frequency of the high-frequency words.
Statistics shows that the words with the top 10% of the frequency word ranks occupy 81.1% of the word set of all academic documents, accords with Zipf distribution, and research on word frequency finds that words which can most represent the subject are often not very high frequency words and very low frequency words, but medium and high frequency words with higher frequency. If the LDA model is directly used for extracting the document, the deletion of certain medium frequency words is caused, meanwhile, the characteristic words with higher word frequency usually appear in a combination way, and the probability that the high frequency words are distributed to the topics is relatively high, so that the distinction degree of each topic is not high, and in S1, the complete filtering can not be realized although the filtering of stop words is carried out in the data preprocessing process.
Therefore, the invention provides the word frequency weighted LDA topic model, firstly, word frequency information in a document is counted, word frequency characteristics are introduced into the Gibbs sampling process, the influence of high-frequency words is reduced, the influence of medium-frequency characteristic words is improved, and the FLDA model is constructed, so that the model is excessively heavier than the high-frequency characteristic word words. The FLDA model is as follows:
obtaining sampling parameters by using Gibbs sampling through LDA modelAnd θ, acquisition parameter->And θ is to construct a converging markov chain from which appropriate samples are extracted.
The distribution process of LDA to phrase is z i Is a sample of (a). Wherein z is i The posterior formula of (2) is as follows:
P(z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i )
z i =j is assigning topic j to the current word W i ,z -i To be allocated to non-z i Word weight sum, W of (2) -i Is a word that is not the current location.
P (w|z) is known to be the onlyCorrelation, thus by->Integrating above to obtain the following formula:
for Gibbs sampling parameter, < >>Gibbs sampling parameter corresponding to current topic j, < ->In order to integrate the parameters of the device,
in addition, in the case of the optical fiber,at the same time (I)>Is->Is thus +.>Integrating to obtain the following formula:
wherein ,is the sum of the weights assigned to topic j and the same word as word w, ++>For the sum of weights of all words assigned to topic j, β is the parameter of the Dirichlet distribution, v is the size of the word stock.
Similarly, P (z) is related to θ only, so the following formula can be obtained by integrating over θ:
By the formula P (z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ), The combination gives the following formula:
through the above calculation, a non-standard distribution of LDA is obtained, but the probability sum of all "topic-phrase" assignments needs to be removed, as shown in the following formula:
wherein ,wi Ith word, z i =j is assigning the current topic j to the current word w i ,z -i To be allocated to non-z i Is used for the word weight sum of (a),representing the topic j and the word w i Equal weight sum, ++>Representing document d i Weight sum, ++of words with subject i>The sum of weights of words with topics in the current document is represented, V represents the word stock size, T represents the number of topics, and P (z) i =j|z -i ,w i ) Is the recalculated posterior probability.
The word frequency weighting formula of the model is as follows:
wherein ,ni Representing the current word frequency, n mid Representing word frequency of selecting intermediate frequency word, n max Represents the maximum value, n, in word frequency statistics min Representing the minimum value in word frequency statistics, C i The weight of the current word is represented, and the value range is [1,2 ]]To ensure that the number of the weighted total feature words is unchanged, the weight of each feature word needs to be adjusted, wherein F i For the adjusted weights of the feature words,for the number of occurrences of the current word, +.>Is the sum of the weights of all words. Referring to FIG. 3, because the probability that the word w is assigned to the topic z when Gibbs samples initial is random, the calculated Fi is replacedThe random value initialized in the Gibbs sampling process is replaced, and the calculation is circularly carried out on the basis of the random value until convergence is achieved, and the parameter +.> and θ.
word2vec method for word vector characterization
Word2vec is trained on a million-level dictionary and billions-level training corpus through deep learning, a training result is a Word vector model, and Word vectors effectively express Word semantic information in space. The training model of the vector refers to a shallow neural network CBOW or Skip-gram model, where the CBOW model is shown in fig. 4.
The CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the context words in an input window period, a Huffman path is obtained by constructing a Huffman tree according to word frequency, probability of leaf nodes is calculated according to the path, then parameters of non-leaf nodes and word vectors of the context are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations.
Weight assignment
Since academic papers are generally classified into information such as titles, summaries, keywords, contents, etc. According to past experience, the title often contains a central idea of the full text and is an important summary of the full text content, so that the invention increases the final weight of words in the title. The keyword part also has certain representative capability on the whole subject matter, and the abstract part is considered to be a brief summary of the whole subject matter.
Preferably, the weight assignment of the present invention resets the title weight to 4, the keyword weight to 3, and the abstract weight to 2.
Selection of FLDA model parameters
In the invention, 20 is selected as the optimal theme number.
S3, mapping field system
Because the academic documents of the students have differences among phrases obtained through the topic model, unified management of the students cannot be performed, and therefore unified measurement of the students is achieved by introducing an academic field system.
The field system is formulated by referring to the national natural science foundation field system, and can cover the research scope of each field to the greatest extent.
The invention maps the result of the topic model into the domain system, and the mapping formula is as follows:
F (A,B) =sim(A,B)*C A *L A
wherein A is a phrase obtained by the topic model, B is a systematic word, a corresponding word vector is obtained by using the vector model, and word vectors are spliced into word vectors by using word vectors for non-registered words. sim (a, B) is the final computed cosine similarity. C (C) A Probability assigned to topic model, L A The value range of the position coefficient of the phrase in the document is [2,3,4 ]],F (A,B) The weighted similarity is obtained. C (C) B The final score for the system word.
S4, comprehensive sequencing
Obtaining a final score C through a mapping formula B According to the score C, all system words corresponding to the current scholars B And (3) sorting from high to low, and taking the systematic words with the highest scores of the first four items as domain tag words which can most represent the research domain of the scholars.
By adopting the technical scheme, the following technical effects are achieved.
Compared with the traditional LDA model, the statistical-based TFIDF algorithm and the network graph-based textRank algorithm, the FLDA model has better effect and higher accuracy of finally obtained tag words, and the tag extraction method based on the topic model has good applicability in the academic field.
Drawings
FIG. 1 is a schematic frame diagram of a subject model-based domain label acquisition method of the present invention;
FIG. 2 is a document-word frequency distribution diagram;
FIG. 3 is a Gibbs sampling flow chart;
FIG. 4 is a schematic view of a CBOW model;
fig. 5 is a confusion-topic number relationship diagram.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
a topic model-based domain label acquisition method comprises the following steps:
s1, data preprocessing
The initial data set is acquired, specifically, the following method is adopted:
s11, data deduplication processing
The embodiment uses word co-occurrence, author contribution and keyword coincidence rate to construct a cleaning model:
for two texts to be compared, firstly judging whether the DOI is the same or not, and directly filtering the same DOI;
and for the DOI which is different or does not exist, firstly judging the title by using the word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of authors and the co-occurrence times of keywords, and if the co-occurrence times of the authors are more than 1 and the key co-occurrence rate is more than 0.5, repeating and clearing the judgment result.
The co-occurrence formula is shown as follows:
wherein A, B is the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A.u.B) is the length of the intersection of the two sets of title words, and min { len (A), len (B) } is the minimum of the two lengths.
S12, word segmentation
In the word segmentation stage, firstly, the paper keyword data are extracted and added into a user dictionary of a word segmentation tool, meanwhile, the TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool.
In addition, the whole corpus data is ordered according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
S2, keyword extraction
The "topic-phrase" extraction is performed by the FLDA, the phrases are weighted according to the locations where they appear in the text, and they are vector-characterized using word2 vec.
The "topic-phrase" extraction is performed by FLDA:
obtaining sampling parameters by using Gibbs sampling through LDA modelAnd θ, acquisition parameter->And θ is to construct a converging markov chain from which appropriate samples are extracted.
The distribution process of LDA to phrase is z i Is a sample of (a). Wherein z is i The posterior formula of (2) is as follows:
P(z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i )
z i =j is assigning topic j to the current word W i ,z -i To be allocated to non-z i Word weight sum, W of (2) -i Is a word that is not the current location.
P (w|z) is known to be the onlyCorrelation, thus by->Integrating above to obtain the following formula:
for Gibbs sampling parameter, < >>Gibbs sampling parameter corresponding to current topic j, < ->In order to integrate the parameters of the device,
in addition, in the case of the optical fiber,at the same time (I)>Is->Is thus +.>Integrating to obtain the following formula:
wherein ,is the sum of the weights assigned to topic j and the same word as word w, ++>For the sum of weights of all words assigned to topic j, β is the parameter of the Dirichlet distribution, v is the size of the word stock.
Similarly, P (z) is related to θ only, so the following formula can be obtained by integrating over θ:
By the formula P (z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ), The combination gives the following formula:
through the above calculation, a non-standard distribution of LDA is obtained, but the probability sum of all "topic-phrase" assignments needs to be removed, as shown in the following formula:
wherein ,wi Ith word, z i =j is assigning the current topic j to the current word w i ,z -i To be allocated to non-z i Is used for the word weight sum of (a),representing the topic j and the word w i Equal weight sum, ++>Representing document d i Weight sum, ++of words with subject i>The sum of weights of words with topics in the current document is represented, V represents the word stock size, T represents the number of topics, and P (z) i =j|z -i ,w i ) Is the recalculated posterior probability.
The word frequency weighting formula of the model is as follows:
wherein ,ni Representing the current word frequency, n mid Representing word frequency of selecting intermediate frequency word, n max Representing the maximum in word frequency statisticsValue n min Representing the minimum value in word frequency statistics, C i The weight of the current word is represented, and the value range is [1,2 ]]To ensure that the number of the weighted total feature words is unchanged, the weight of each feature word needs to be adjusted, wherein F i For the adjusted weights of the feature words,for the number of occurrences of the current word, +.>Is the sum of the weights of all words. Referring to fig. 3, since the probability that the word w is assigned to the topic z at the time of Gibbs sampling initial is random, fi calculated is replaced by a random value initialized during Gibbs sampling, and is circularly calculated on the basis of this until convergence and a parameter is obtained +> and θ.
word2vec method for word vector characterization
Word2vec is trained on a million-level dictionary and billions-level training corpus through deep learning, a training result is a Word vector model, and Word vectors effectively express Word semantic information in space.
Specifically, a CBOW model is adopted for vector training. The CBOW model is characterized in that a current word is predicted according to context, during training, an N-dimensional word vector is initialized for all words, the model accumulates the context words in an input window period, a Huffman path is obtained by constructing a Huffman tree according to word frequency, probability of leaf nodes is calculated according to the path, then parameters of non-leaf nodes and word vectors of the context are adjusted by adopting a gradient descent method, and the result is converged to a real result after multiple iterations.
Weight assignment
In this embodiment, the title weight is set to 4, the keyword weight is set to 3, and the abstract weight is set to 2.
As a preferred solution, 20 is selected as the optimal number of topics in this embodiment.
S3, mapping field system
Mapping the theme-phrase to a system to realize unified management of the domain of the scholars;
in this embodiment, the result of the topic model is mapped into the domain system, and the mapping formula is as follows:
F (A,B) =sim(A,B)*C A *L A
wherein A is a phrase obtained by the topic model, B is a systematic word, a corresponding word vector is obtained by using the vector model, and word vectors are spliced into word vectors by using word vectors for non-registered words. sim (a, B) is the final computed cosine similarity. C (C) A Probability assigned to topic model, L A The value range of the position coefficient of the phrase in the document is [2,3,4 ]],F (A,B) The weighted similarity is obtained. C (C) B The final score for the system word.
S4, comprehensive sequencing
And (5) weighting and sequencing the vector characterization result and the weight assignment result to obtain the tag word which can represent the scholars most.
Specifically, a final score C is obtained through a mapping formula B And ordering all system words corresponding to the current scholars according to the scores from high to low, and taking the system word with the highest score of the first four items as the domain label word which can most represent the research domain of the scholars.
Experimental example:
in order to acquire experimental data as real as possible, the invention uses a web crawler technology to crawl CNKI and paper data sources in a mastership database, uses jieba for word segmentation processing of the data, uses a word vector model published by a Tencent AI laboratory for vector characterization, and introduces four aspects of evaluation criteria introduction, data preprocessing, selection of FLDA model parameter topic number and evaluation of a topic model-based label algorithm.
Evaluation criteria:
because the LDA topic model belongs to an unsupervised model, no visual evaluation standard is compared to measure the quality of the model. In the experimental example, a matrix of a theme-phrase of a theme model is selected for evaluation, and the confusion degree is introduced as an evaluation standard of the model, and generally, the lower the confusion degree is, the better the effect of the model is. The calculation formula of the confusion degree is shown as a formula.
p(w)=p(z|d)*p(w|z)。
Wherein Perplexity (D) represents the confusion of the current model, D table is academic document text, M represents the number of academic documents,the sum of the numbers of all words in the current corpus, p (w) is the probability that word w appears in the matrix, p (z|d) represents the probability that the academic document d is the topic z, and p (w|z) represents the probability that word w appears in the topic z. The confusion measures the degree of coincidence of the result predicted by the topic model with the original sample information.
When the label accuracy is calculated, F1 values are used for measuring the accuracy of labels of scholars, a plurality of scholars are selected randomly, labels in the field are manually referred, and the 4 most suitable labels are selected as correct labels by combining knowledge of the scholars. The first four labels and the correct label obtained according to the algorithm sequence are evaluated and calculated by using the indexes, and finally an average F1 value is calculated, wherein the calculation formula is as follows:
wherein ,hi Represents the number of standard labels, m i Representing the number of labels obtained by the algorithm, h i ∩m i Indicating the correct number of labels the algorithm gets. N is the total number of samples.
Data preprocessing
In order to eliminate the influence of repeated data caused by crawling multiple data sources on a calculation result, the data is required to be subjected to de-duplication processing in a preprocessing stage, a cleaning model is constructed by using word co-occurrence, author contribution and keyword coincidence rate, for two texts to be compared, whether DOI is the same or not is firstly judged, direct filtering is carried out on the same DOI, and for the DOI which is different or does not exist, firstly, the title is judged by using word co-occurrence, if the co-occurrence degree is more than 80%, the author co-occurrence number and the keyword co-occurrence number are continuously judged, and if the author co-occurrence number is more than 1 and the keyword co-occurrence rate is more than 0.5, the judgment result is repeated and cleared.
In the word segmentation stage, massive paper keyword data are firstly extracted and added into a user dictionary of a word segmentation tool, meanwhile, a TextRank is used for extracting key phrases before calculation, and the key phrases are also added into the user dictionary of the word segmentation tool. In addition, the whole corpus data is ordered according to word frequency, high-frequency irrelevant words are manually screened, and the irrelevant words are added into a stop word list of the word segmentation tool.
Selection of FLDA model parameters
The selection of the number of topics in the topic model is an important factor influencing the topic clustering result, if the number of topics is too small, the model clustering result is not distinguished, and if the number of topics is too large, the current document is divided into other topics in an error mode, so that the section calculates the confusion degree under different topic numbers through experiments, and determines the final topic number according to the confusion degree. The experiment only changed the number of subjects with the remaining parameters unchanged, and the experimental results obtained are shown in fig. 4.
The LDA curve is a confusion curve of the LDA topic model, and the FLDA curve is a confusion curve of the LDA topic model weighted by word frequency. The abscissa in the relation graph is the subject number, and the ordinate is the confusion degree. The same set of parameters is used for repeating the experiment three times, and the average value of the experimental results is taken.
According to experimental results, as the number of topics increases, the confusion degree of the three models is in a descending trend, and the descending amplitude is slowed down or even converged when the number of topics is about 20, so that 20 is selected as the optimal number of topics.
Evaluation of tag algorithm
The domain system is introduced to achieve the academic domain of the unified survey scholars, wherein the domain system is formulated with reference to the national natural science foundation domain system, and is appropriately modified on the basis, and the domain system is shown in table 1.
Table 1 field architecture example
Because of the lack of more authoritative data sets in the aspect of label extraction in the academic field, in order to verify the effectiveness of the algorithm, the experimental data of this time use academic paper data of 12 scholars, and respectively use a TFIDF algorithm, a textRank algorithm, an LDA algorithm and an FLDA algorithm to obtain academic labels of the scholars and compare the academic labels, and a specific example is shown in Table 2.
When the algorithm is evaluated, proper words in a system are selected manually according to the homepage introduction of a scholars, the brief seal is incurred and the like to serve as standard answers of labels in the field of the scholars, the label words obtained by the algorithm are called as current answers, and the effect evaluation is carried out on the current answers and the standard answers.
Table 2 algorithm extraction tag results comparison
Table 3 Algorithm F1 value comparison
As can be seen from table 2, the label obtained by FLDA algorithm has higher coincidence with the standard answer. The algorithm effect is better than that of LDA in TFIDF algorithm.
The data in Table 3 are calculated by referring to formulas (15) (16) (17) to obtain F1 value, wherein 4-2 represents the label with the highest score obtained by using the algorithm, and the label is calculated with two standard answers; 4-3 represents obtaining 4 labels with highest scores by using an algorithm, and calculating with three standard answers; 4-4 is calculated for four standard answers. It is analytically available that in many prediction cases, the F1 value of FLDA is higher than that of the conventional LDA algorithm, the statistical-based TFIDF algorithm, and the TextRank algorithm based on network map. The method and the device have the advantages that through introducing the FLDA model weighted by the multi-word frequency characteristics, not only can the content of the document and the relation between the content of the document be analyzed from the chapter and the relation, but also the academic data can be reduced in dimension, so that the method and the device are more suitable for processing academic documents with certain orders of magnitude, and are beneficial to subsequent label mapping and calculation. The model can reflect the research direction of the scholars to a certain extent, so that the user can conveniently and comprehensively understand the scholars, and the time and energy of the user are saved. The method also indirectly illustrates that the FLDA algorithm weighted by the multi-word frequency features can better extract key information in academic texts compared with the traditional algorithm.
The technical scheme provided by the invention is not limited by the embodiment, and all the technical schemes formed by using the structure and the mode of the invention through transformation and substitution are within the protection scope of the invention.
Claims (7)
1. A topic model-based domain label acquisition method is characterized by comprising the following steps of: comprising the steps of the following procedure of the method,
s1, data preprocessing
Acquiring an initial data set;
s2, keyword extraction
Extracting a 'theme-phrase' through the FLDA, carrying out weight assignment on the phrase according to the position appearing therein, and carrying out vector characterization on the phrase by using word2 vec;
s3, mapping field system
Mapping the theme-phrase to a system to realize unified management of the domain of the scholars;
s4, comprehensive sequencing
The vector characterization result and the weight assignment result are weighted and sequenced, and the label word which can represent the scholars best is obtained through a threshold value;
the method for extracting the topic-phrase through FLDA comprises the steps of obtaining sampling parameters through Gibbs sampling and θ,
z i the posterior formula of (2) is as follows:
P(z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ),
wherein ,zi =j is assigning topic j to the current word W i ,z -i To be allocated to non-z i Word weight sum, W of (2) -i As a word that is not in the current position,
p (w|z) is known to be the onlyCorrelation, thus by->Integrating above to obtain the following formula:
wherein ,for Gibbs sampling parameter, < >>Gibbs sampling parameter corresponding to current topic j, < ->In order to integrate the parameters of the device,
in addition, in the case of the optical fiber,at the same time (I)>Is->Is thus +.>Integrating to obtain the following formula:
wherein ,is the sum of the weights assigned to topic j and the same word as word w, ++>For the sum of the weights of all words assigned to the topic j, beta is the parameter of the Dirichlet distribution, v is the size of the word stock,
similarly, P (z) is related to θ only, so the following formula can be obtained by integrating over θ:
by the formula P (z i =j|z -i ,w)→P(w i |z i =j,z -i ,w -i )P(z i =j|z -i ), The combination gives the following formula:
through the above calculation, a non-standard distribution of LDA is obtained, and then the probability sum of all the "topic-phrase" assignments is removed, as shown in the following formula:
wherein ,wi Ith word, z i =j is assigning the current topic j to the current word w i ,z -i To be allocated to non-z i Is used for the word weight sum of (a),representing the topic j and the word w i Equal weight sum, ++>Representing document d i Weight sum, ++of words with subject i>The sum of weights of words with topics in the current document is represented, V represents the word stock size, T represents the number of topics, and P (z) i =j|z -i ,w i ) Is the recalculated posterior probability;
the word frequency weighting formula of the model is as follows:
wherein ,ni Representing the current word frequency, n mid Representing word frequency of selecting intermediate frequency word, n max Represents the maximum value, n, in word frequency statistics min Representing the minimum value in word frequency statistics, C i The weight of the current word is represented, and the value range is [1,2 ]]To ensure that the number of the weighted total feature words is unchanged, the weight of each feature word needs to be adjusted, wherein F i For the adjusted weights of the feature words,for the number of occurrences of the current word, +.>The sum of weights for all words;
2. The topic model-based domain tag acquisition method of claim 1, wherein:
s1, data preprocessing comprises S11, data duplication removal processing and S12, word segmentation;
s11, performing data deduplication processing to obtain a literature set of a learner;
constructing a cleaning model by using the co-occurrence of the words, the contribution of the authors and the coincidence rate of the keywords:
for two texts to be compared, firstly judging whether the DOI is the same or not, and directly filtering the same DOI;
for the fact that DOI is different or does not exist, firstly judging the title by using word co-occurrence, if the co-occurrence degree is more than 80%, continuously judging the co-occurrence times of authors and the co-occurrence times of keywords, and if the co-occurrence times of the authors are more than 1 and the co-occurrence rate of the keywords is more than 0.5, repeating and clearing the judging result;
the co-occurrence formula is shown as follows:
wherein A, B is the word set of two titles, len (A) is the length of the word set of title A, len (B) is the length of the word set of title B, len (A n B) is the length of the intersection of the two sets of title words, min { len (A), len (B) } is the minimum of the two lengths;
s12, word segmentation is carried out to obtain an initial data set;
firstly extracting paper keyword data and adding the paper keyword data into a user dictionary of a word segmentation tool, extracting key phrases before calculation by using a TextRank, and adding the key phrases into the user dictionary of the word segmentation tool;
and sorting the whole corpus data according to word frequency, manually screening high-frequency irrelevant words, and adding the irrelevant words into a stop word list of the word segmentation tool.
3. The topic model-based domain tag acquisition method of claim 1, wherein: the vector characterization method is word2vec method.
4. The topic model-based domain tag acquisition method of claim 1, wherein: the title weight is set to 4, the keyword weight is set to 3, and the abstract weight is set to 2.
5. The topic model-based domain label acquisition method according to any one of claims 1 to 4, wherein: the number of subjects of the FLDA model is 20.
6. The topic model-based domain tag acquisition method of claim 1, wherein: s3, a mapping formula of the domain system mapping is as follows:
F (A,B) =sim(A,B)*C A *L A
wherein A is a phrase obtained by a topic model, B is a systematic word, and a vector model is used for obtaining pairsCorresponding word vectors, for the unknown words, word vectors are spliced into word vectors by using word vectors, sim (A, B) is the cosine similarity calculated finally, C A Probability assigned to topic model, L A The value range of the position coefficient of the phrase in the document is [2,3,4 ]],F (A,B) For weighted similarity, C B The final score for the system word.
7. The topic model-based domain tag acquisition method of claim 1, wherein: s4, comprehensively sequencing, and scoring all system words corresponding to the current scholars according to the score C B And (5) sorting from high to low, and taking the number system words with the highest score as domain label words which can most represent the research domain of the scholars.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910784200.3A CN110543564B (en) | 2019-08-23 | 2019-08-23 | Domain label acquisition method based on topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910784200.3A CN110543564B (en) | 2019-08-23 | 2019-08-23 | Domain label acquisition method based on topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110543564A CN110543564A (en) | 2019-12-06 |
CN110543564B true CN110543564B (en) | 2023-06-20 |
Family
ID=68712039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910784200.3A Active CN110543564B (en) | 2019-08-23 | 2019-08-23 | Domain label acquisition method based on topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110543564B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241283B (en) * | 2020-01-15 | 2023-04-07 | 电子科技大学 | Rapid characterization method for portrait of scientific research student |
CN111831804B (en) * | 2020-06-29 | 2024-04-26 | 深圳价值在线信息科技股份有限公司 | Method and device for extracting key phrase, terminal equipment and storage medium |
CN112508376A (en) * | 2020-11-30 | 2021-03-16 | 中国科学院深圳先进技术研究院 | Index system construction method |
CN112446204B (en) * | 2020-12-07 | 2024-08-02 | 北京明略软件系统有限公司 | Method, system and computer equipment for determining document label |
CN112883148B (en) * | 2021-01-15 | 2023-03-28 | 博观创新(上海)大数据科技有限公司 | Subject talent evaluation control method and device based on research trend matching |
CN113190672A (en) * | 2021-05-12 | 2021-07-30 | 上海热血网络科技有限公司 | Advertisement judgment model, advertisement filtering method and system |
CN113298399B (en) * | 2021-05-31 | 2023-04-07 | 西南大学 | Scientific research project analysis method based on big data |
CN113887198A (en) * | 2021-10-11 | 2022-01-04 | 平安国际智慧城市科技股份有限公司 | Project splitting method, device and equipment based on topic prediction and storage medium |
CN114492425B (en) * | 2021-12-30 | 2023-04-07 | 中科大数据研究院 | Method for communicating multi-dimensional data by adopting one set of field label system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740342A (en) * | 2016-01-22 | 2016-07-06 | 天津中科智能识别产业技术研究院有限公司 | Social relation topic model based social network friend recommendation method |
CN109766544A (en) * | 2018-12-24 | 2019-05-17 | 中国科学院合肥物质科学研究院 | Document keyword abstraction method and device based on LDA and term vector |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9449096B2 (en) * | 2014-01-07 | 2016-09-20 | International Business Machines Corporation | Identifying influencers for topics in social media |
-
2019
- 2019-08-23 CN CN201910784200.3A patent/CN110543564B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740342A (en) * | 2016-01-22 | 2016-07-06 | 天津中科智能识别产业技术研究院有限公司 | Social relation topic model based social network friend recommendation method |
CN109766544A (en) * | 2018-12-24 | 2019-05-17 | 中国科学院合肥物质科学研究院 | Document keyword abstraction method and device based on LDA and term vector |
Non-Patent Citations (3)
Title |
---|
Supervised topic models for multi-label classification;Ximing Li 等;《Neurocomputing》;20151231;全文 * |
基于SL-LDA的领域标签获取方法;王胜 等;《计算机科学》;20201130;全文 * |
基于主题模型的多标签文本分类和流文本数据建模若干问题研究;李熙铭;《中国优秀博士学位论文库》;20150815;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110543564A (en) | 2019-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543564B (en) | Domain label acquisition method based on topic model | |
CN109492157B (en) | News recommendation method and theme characterization method based on RNN and attention mechanism | |
CN110059311B (en) | Judicial text data-oriented keyword extraction method and system | |
CN109858028B (en) | Short text similarity calculation method based on probability model | |
CN105653706B (en) | A kind of multilayer quotation based on literature content knowledge mapping recommends method | |
CN109271477B (en) | Method and system for constructing classified corpus by means of Internet | |
CN108132927B (en) | Keyword extraction method for combining graph structure and node association | |
CN106997382A (en) | Innovation intention label automatic marking method and system based on big data | |
Wang et al. | Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications | |
CN107220295A (en) | A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method | |
KR20020049164A (en) | The System and Method for Auto - Document - classification by Learning Category using Genetic algorithm and Term cluster | |
US20180341686A1 (en) | System and method for data search based on top-to-bottom similarity analysis | |
KR20200007713A (en) | Method and Apparatus for determining a topic based on sentiment analysis | |
CN114706972B (en) | Automatic generation method of unsupervised scientific and technological information abstract based on multi-sentence compression | |
CN106776672A (en) | Technology development grain figure determines method | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN109670014A (en) | A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning | |
CN112035658A (en) | Enterprise public opinion monitoring method based on deep learning | |
CN116501875B (en) | Document processing method and system based on natural language and knowledge graph | |
CN108647322A (en) | The method that word-based net identifies a large amount of Web text messages similarities | |
CN113868387A (en) | Word2vec medical similar problem retrieval method based on improved tf-idf weighting | |
CN114265935A (en) | Science and technology project establishment management auxiliary decision-making method and system based on text mining | |
CN115952292A (en) | Multi-label classification method, device and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |