CN112784009A - Subject term mining method and device, electronic equipment and storage medium - Google Patents

Subject term mining method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112784009A
CN112784009A CN202011580178.XA CN202011580178A CN112784009A CN 112784009 A CN112784009 A CN 112784009A CN 202011580178 A CN202011580178 A CN 202011580178A CN 112784009 A CN112784009 A CN 112784009A
Authority
CN
China
Prior art keywords
candidate
word
importance
text data
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011580178.XA
Other languages
Chinese (zh)
Other versions
CN112784009B (en
Inventor
熊永平
曹滔宇
朱承治
谷纪亭
徐翀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Beijing University of Posts and Telecommunications
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Beijing University of Posts and Telecommunications filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN202011580178.XA priority Critical patent/CN112784009B/en
Publication of CN112784009A publication Critical patent/CN112784009A/en
Application granted granted Critical
Publication of CN112784009B publication Critical patent/CN112784009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

One or more embodiments of the present application provide a topic word mining method, apparatus, electronic device, and storage medium, including: acquiring text data; filtering the text data based on the language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject word according to the importance result of the candidate word set. The method filters out characters with low degree of cohesion inside the text data through the language model, reduces the influence of characters which are not tightly spliced on the excavation of the subject term, reflects the uncertainty of left and right adjacent characters of the vocabulary through the degree of freedom of the vocabulary in the text data, finds out the vocabulary which can be freely and independently used, reduces the excavation range of the subject term, fully considers the complex structure of Chinese text linguistic data, and can excavate the potential subject term formed by emerging professional vocabularies according to the importance degree sequence while identifying the subject term of the text data through layer-by-layer screening.

Description

Subject term mining method and device, electronic equipment and storage medium
Technical Field
One or more embodiments of the present application relate to the field of data mining technologies, and in particular, to a topic word mining method and apparatus, an electronic device, and a storage medium.
Background
In the prior art, the text volume of the project text duplication checking work is large, the granularity of the project text is high, and the rapid retrieval of similar documents in a document database becomes a primary problem of improving the accuracy and efficiency of the duplication checking work. Since scientific and technological projects or scientific research documents usually surround a plurality of keywords, and the keywords reflect the gist of text description to a certain extent, similarity between texts can be checked only by finding and comparing the industry subject terms found in each text.
For a Chinese text, the method has a more complex organization structure than an English text, and the problem that the method for mining subject words of an English text in the prior art is generally applied to mining subject words of a Chinese text or extracting subject words based on manual work, so that the accuracy is low, and the potential subject words composed of emerging professional vocabularies in the Chinese text cannot be accurately mined.
Disclosure of Invention
In view of the above, one or more embodiments of the present application are directed to a topic mining method, an apparatus, an electronic device and a storage medium, so as to solve at least one of the above problems in the prior art.
In view of the above, one or more embodiments of the present application provide a topic word mining method, including:
acquiring text data;
filtering the text data based on a language model to determine a set of candidate words;
screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set;
and determining the subject word according to the importance result of the candidate word set.
Optionally, the acquiring text data specifically includes:
acquiring an industry text corpus to be processed;
preprocessing the industry text corpus to obtain the text data; the preprocessing operation comprises: deleting redundant characters, determining text granularity and performing line division processing.
Optionally, the filtering the text data based on the language model to determine a candidate word set specifically includes:
determining word length and word frequency of words in the text data according to the text data based on the language model;
selecting the vocabulary in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by using a data mining strategy to determine candidate words;
and determining the candidate word set according to the candidate words based on the degree of solidification and the degree of freedom screening strategy.
Optionally, the determining the candidate word set according to the candidate words based on the solidity and the degree of freedom screening policy specifically includes:
determining a degree of solidity SD (W) of the candidate wordi) And degree of freedom FD (W)i) (ii) a Degree of solidity SD (W) of the candidate wordi) Is shown as
Figure BDA0002865060590000021
Wherein, WiRepresents said candidate word, Wi=C1C2...Cn,C1C2...CnRepresenting characters, p () representing a probability function;
degree of freedom FD (W) of the candidate wordi) Is shown as
FD(Wi)=min{LE(Wi),RE(Wi)}
Wherein LE (W)i) Left-adjacent entropy, RE (W), representing the candidate wordi) Representing a right neighbor entropy of the candidate word;
selecting the solidity SD (W) based on the solidity and freedom screening strategyi) Not less than a freezing degree threshold and the degree of freedom FD (W)i) The candidate words not less than a degree of freedom threshold to determine the set of candidate words.
Optionally, the result of the importance of the candidate word set includes: EMS of first importance (W)i);
The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically includes:
determining the first EMS (W) of importance of each of the candidate words in the set of candidate words with respect to the text data according to the unsupervised algorithmi) (ii) a The first importance EMS (W)i) Is shown as
Figure BDA0002865060590000031
Wherein, TjRepresenting a text segment r obtained after segmenting the text datai() An iteration function is represented.
Optionally, the result of the importance of the candidate word set further includes: second importance LCS (W)i);
The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically comprises:
determining the second importance LCS (W) of each of the candidate words of the set of candidate words with respect to the text data according to the prediction modeli)。
Optionally, the determining a subject word according to the result of the importance of the candidate word set specifically includes:
determining the importance scores of the candidate words according to the importance results of the candidate word set;
and sequentially arranging the importance scores of the candidate words from large to small, selecting a preset number of the candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores in sequence, and determining the selected candidate words as the subject words.
Based on the same inventive concept, one or more embodiments of the present application further provide a topic word mining apparatus, including:
an acquisition module configured to acquire text data;
a filtering module configured to filter the textual data based on a language model to determine a set of candidate words;
a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine an importance result of the set of candidate words;
and the determining module is configured to determine the subject word according to the importance result of the candidate word set.
Based on the same inventive concept, one or more embodiments of the present application further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the topic word mining method described in any one of the above items.
Based on the same inventive concept, one or more embodiments of the present application further propose a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the topic word mining method described in any one of the above.
As can be seen from the above description, one or more embodiments of the present application provide a topic word mining method, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject word according to the importance result of the candidate word set. The method provided by the application obtains the text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on subject words in the text data, filters the text data through the language model to determine a candidate word set, filters out characters with low degree of cohesion in the text data through the language model, and reduces the influence of loosely spliced characters on subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the mining range of the subject word is reduced; and further screening a candidate word set according to an unsupervised algorithm and a supervised prediction model, and excavating the subject words by sequencing the importance of each candidate word in the candidate word set. The complex structure of Chinese text corpora is fully considered, and through layer-by-layer screening, the subject terms of the text data are recognized, and meanwhile, potential subject terms formed by emerging professional vocabularies can be mined according to the importance degree sequence.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions in the present application, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the description below are only one or more embodiments in the present application, and that other drawings can be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1 is a flow diagram of a topic word mining method in one or more embodiments of the application;
fig. 2 is a schematic structural diagram of a topic word mining apparatus according to one or more embodiments of the present application;
fig. 3 is a schematic structural diagram of an electronic device in one or more embodiments of the present application.
Detailed Description
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present application shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the present application does not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background section, in the prior art, as texts in professional fields become increasingly rich, the number of documents in the document database increases exponentially, and in order to make scientific research expenses invest in research projects with high value density, a multi-stage review link needs to be set in the establishment process of the scientific research projects.
The applicant finds, through research, that in the prior art, the above manual-based duplicate checking mode has high requirements on the professional skill level of workers, and has the problems of low query efficiency, duplicate checking omission and error rate and the like. In addition, the duplication checking work of the scientific research projects at present faces the challenges of larger text corpus volume and higher project text corpus granularity. Therefore, a method capable of improving the accuracy and efficiency of the duplication checking work of scientific and technical projects and scientific research documents, that is, a method capable of quickly searching out similar documents in a document database, needs to be found. The applicant finds that the centers of scientific research projects or scientific research documents usually surround a plurality of keywords which reflect the gist of text description to a certain extent, so that for the duplication checking work of the scientific research projects and the scientific research documents in the industry, the similarity between texts can be checked only by finding and comparing the industry subject words found in each text. Moreover, unlike the english text, the chinese text has a more complex organization structure, so the chinese document has a higher processing difficulty than the english document, and the methods of the prior art based on manual duplication searching or using the method of mining the subject words of the english text have the problems of low accuracy and incapability of accurately mining the potential subject words composed of new and emerging professional vocabularies in the chinese text.
Therefore, the method provided by the application obtains the text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on the subject words in the text data, filters the text data through the language model to determine a candidate word set, filters out characters with low degree of cohesion in the text data through the language model, and reduces the influence of untight characters in splicing on the subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the mining range of the subject word is reduced; and further screening a candidate word set according to an unsupervised algorithm and a supervised prediction model, and excavating the subject words by sequencing the importance of each candidate word in the candidate word set. The complex structure of Chinese text corpora is fully considered, and through layer-by-layer screening, the subject terms of the text data are identified, meanwhile, potential subject terms formed by emerging professional vocabularies can be excavated according to the importance degree sequence, and the accuracy and efficiency of the subject term excavation and extraction are improved.
Hereinafter, the technical means of the present disclosure will be described in further detail with reference to specific examples.
Referring to fig. 1, a topic word mining method provided in one or more embodiments of the present application specifically includes the following steps:
s101: text data is acquired.
In this embodiment, the text data to be subject to topic word mining is obtained, specifically, the text data is obtained by obtaining an industry text corpus to be processed, then performing a preprocessing operation on the industry text corpus to be processed, and cleaning the industry text corpus to be processed. The industrial corpus of text to be processed may be text-type data such as electronic books, news documents, research papers, digital libraries, WEB pages, e-mails, database records, and the like.
In some optional embodiments, the pre-processing operation may include: deleting redundant characters, determining text granularity, line division processing, Chinese word segmentation, deleting format marks and the like. For example, for web page text data, the web page tag needs to be removed, so that plain text is obtained.
S102: filtering the text data based on a language model to determine a set of candidate words.
In this embodiment, after the text data is obtained, because the vocabulary in the text data is huge, the text data may be filtered based on the language model, so as to determine the candidate word set, and the vocabulary in the candidate word set represents the vocabulary in the text data. Specifically, word length and word frequency of words in text data can be determined according to the text data based on a language model, then words in the text data with word length not greater than a word length threshold and word frequency not less than a word frequency threshold are selected by using a data mining strategy to determine candidate words, and after the candidate words are determined, a candidate word set is determined according to the candidate words based on a freezing degree and freedom degree screening strategy.
In some alternative embodiments, the language model may select an N-gram model, where N is in the range of [2,10 ]. After preprocessing operation is carried out on an industrial text corpus to be processed, the granularity of text data can be determined, the text data is segmented into a plurality of text segments, wherein the text segments can be sentences, paragraphs, or even a whole document is used as one text segment, the text data is arranged into a set of the text segments, and punctuation marks in the text segments are filtered out.
In some alternative embodiments, for each vocabulary in the text data, the frequency F (W) with which the vocabulary appears in the text data is calculatedi) And the word length of the word, a word length threshold TL of 10 and a word frequency threshold TF of 3 may be set in advance. According to the granularity of the text segments, selecting vocabularies in the text data with the word length not greater than the word length threshold and the word frequency not less than the word frequency threshold by using a data mining strategy (such as an Aprori strategy), and determining the vocabularies meeting the conditions as candidate words. An overcomplete dictionary may be composed from candidate words and normalized word frequency (e.g., interval of [0, 1)]) The usage of each candidate word is initialized.
In some alternative embodiments, after determining the candidate words, a set of candidate words may be determined from the candidate words based on a solidity and freedom screening strategy. Since each candidate word may be composed of a plurality of characters, the candidate word may be represented as Wi=C1C2...Cn-1Cn,C1C2...CnRepresenting a character. For any one candidate word WiFor the inside, the degree of cohesion of the inside can be analyzed, namely whether the characters are spliced together tightly enough, and the higher the degree of cohesion, the more likely the candidate word becomes an independent word; for the external part, whether the performance of the current candidate word in the whole text data can be independently and freely applied can be analyzed, namely, the uncertainty of the left and right adjacent characters of the candidate word is large, namely the entropy is large, and the larger the entropy value is, the candidate word can be independently and freely applied is indicated.
It should be noted that, determining a candidate word set according to the candidate words based on the degree of solidification and the degree of freedom screening policy may specifically include: determining a solidity SD (W) of a candidate wordi) And degree of freedom FD (W)i) Wherein the degree of solidity SD (W) of the candidate wordi) Is shown as
Figure BDA0002865060590000071
Where p () represents a probability function. When determining the degree of freedom of a candidate word, the left-adjacent entropy and the right-adjacent entropy of the candidate word may be determined first, and specifically, when calculating the left-adjacent entropy, a set of all left-adjacent words of any one candidate word is defined as
Sleft(Wi)={C0,i,i=1,2,...,m}
Wherein, C0,iRepresenting the left adjacent word. Further, the probability of occurrence of each character of the left-adjacent word may be expressed as
Figure BDA0002865060590000072
Where count () represents a statistical function. By combining the above parameters, the left-adjacent entropy for any candidate word can be expressed as
Figure BDA0002865060590000073
Similarly, when calculating the right adjacent entropy, defining all right adjacent character sets of the candidate words as
Sright(Wi)={Cn+1,i,i=1,2,...,m}
Wherein, Cn+1,iRepresented as a right adjacent word. Thus, the right neighbor entropy of any one candidate word can be expressed as
Figure BDA0002865060590000081
After the degree of solidification and the degree of freedom of the candidate words are determined, selecting the candidate words with the degree of solidification not less than a preset degree of solidification threshold value and the degree of freedom not less than a preset degree of freedom threshold value based on the degree of solidification and degree of freedom screening strategies, and further screening to obtain a candidate word set. The preset freezing degree threshold value can be 5.3, the preset freedom degree threshold value can be 0.75, and in actual application, the preset freezing degree threshold value and the preset freedom degree threshold value can be dynamically adjusted according to the situation of the corpus.
S103: screening the set of candidate words based on an unsupervised algorithm and a predictive model to determine an importance result for the set of candidate words.
In this embodiment, after the candidate word set is determined, the candidate word set may be screened based on an unsupervised algorithm and in combination with a prediction model obtained by supervised training, so as to obtain an importance result of the candidate word set. Before the unsupervised algorithm is used to filter the candidate word set, the following parameters may be defined: assuming that a sentence is composed of a set of words, each word is composed of a set of characters, a dictionary in text data is defined as
D={W1,W2,...,WN}。
Suppose that each sentence in the text data is constructed by concatenating a set of candidate words randomly sampled from the dictionary D with a sampling probability of θi. Defining a dictionary probability parameter as
Figure BDA0002865060590000082
For a sentence generated by K vocabularies, the sentence can be expressed as
S=Wi1Wi2...WiK
The probability of generating the sentence is expressed as
Figure BDA0002865060590000083
Wherein P () represents a probability matrix comprising a plurality of probability values; for a given one of the unsliced text segments T, define CTFor the set of all segmented sentences according to the dictionary D, the probability of an unsingulated text segment T can be expressed as
Figure BDA0002865060590000091
It should be noted that the importance result of the candidate word set includes: EMS of first importance (W)i) (ii) a The screening of the candidate word set based on the unsupervised algorithm specifically comprises the following steps: determining a first importance EMS (W) of each candidate word in the candidate word set relative to the text data according to an unsupervised algorithm (also called EMwords algorithm)i). Determining first importance EMS (W)i) It is necessary to define thetarIs a parameter of the iterative estimation of the r-th round, and determines an iterative function by using an expected step (namely E-step) of an unsupervised algorithm, which is expressed as
Figure BDA0002865060590000092
Obtaining an iterative function Q (theta ) using a very large step (M-step) of an unsupervised algorithmr) Can be calculated by using the following expressionrUpdating:
Figure BDA0002865060590000093
wherein n isi(S) represents the probability of the candidate word appearing in the sentence S, ni(Tj) Indicating candidate words in text segment TjCan be expressed as
Figure BDA0002865060590000094
Figure BDA0002865060590000095
The sum representing the probability of the candidate word occurring in each text passage can be expressed as
Figure BDA0002865060590000096
Integrating the above parameters to determine a first importance EMS (W)i) Is shown as
Figure BDA0002865060590000097
Wherein r isi() Representing an iterative function, ri(Tj) Represents an importance parameter, which can be expressed as
Figure BDA0002865060590000098
Where I () denotes a selection function, I is 1 when the expression in parentheses is true, otherwise I is 0,
Figure BDA0002865060590000101
representing the optimal iteration parameters. In addition, r isi(Tj) Is an importance parameter, and defines a first importance EMS (W) by negative mappingi) While limiting the interval [0,1 ]]For example, a word with high importance in a sentence usually appears more frequently, but the importance parameter ri(Tj) For words with a high frequency of occurrence, the corresponding value is smaller, i.e. the importance parameter ri(Tj) The calculated value is inversely proportional to the importance of the vocabulary, so that the interval can be limited by a logarithmic function while negative mapping is performed.
In some optional embodiments, the unsupervised algorithm is used to find that the subject word has the characteristics of being theoretically solid, not needing to mark data and not depending on a knowledge base, and after the candidate word set is determined from the text data and screened according to the importance degree, the obtained preliminary screening result is a result with a high lower limit. And then, a prediction model which is well supervised and trained can be used for further screening, and for text data in different fields, field vocabularies can be labeled through the acquired public and high-quality experts and used as a knowledge base or a training set to train the supervised model, so that the effect of the prediction model is improved. For example, in the training data preparation stage, a related authoritative domain professional dictionary and journal paper keywords are collected to obtain 110 domain professional words, and a general word of over 120 thousands is obtained from 30G news corpus of a dog on the network. Respectively labeling the domain vocabularies and the general vocabularies, wherein B-COMMON and I-COMMON represent general vocabulary labels, B-ELEC and I-ELEC represent domain vocabulary labels, and prefix B and prefix I represent start marks and internal marks of the vocabulary labels respectively, then segmenting words in the existing professional lexicon and general lexicon into single characters, and reserving Word structures to input the words into a Word2vec network for training to produce character vectors with the dimension of m.
In some optional embodiments, the importance result of the candidate word set further includes: second importance LCS (W)i) Screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set, further comprising: determining a second importance LCS (W) of each candidate word of the set of candidate words with respect to the text data on the basis of the prediction modeli). Specifically, the prediction model for topic word discovery may adopt a BilSTM + CRF structure based on word vectors, and compared with the traditional BilSTM model, a CRF layer is added to learn an optimal path, and the output vector dimension of the BilSTM layer is defined as tag-size, that is, equivalent to w candidate words per candidate wordiMapping to tagged tagjLet P be the output matrix of the BilSTM layer, then Pi,jRepresents a candidate word wiMapping to tagjIs measured. For the CRF layer, it is assumed that there is a transition matrix A, where Ai,jRepresents tagiTransfer to tagjThe transition probability of (2). For the sequence y of the output tag corresponding to the input sequence X (i.e. representing the second tag), the score is defined as
Figure BDA0002865060590000111
Wherein the content of the first and second substances,
Figure BDA0002865060590000112
indicating the transition probability of the ith label to the (i + 1) th label,
Figure BDA0002865060590000113
represents a candidate word wiNon-normalized probability of mapping to ith label.
Normalizing the sequence y of each tag label correctly corresponding to the tag label by using a Softmax function to obtain a probability value, namely a likelihood probability represented as
Figure BDA0002865060590000114
Wherein, YxAnd a sequence y representing all tag labels corresponding to the input sequence X. Thus, in training, it is only necessary to maximize the likelihood probability p (y | X), where log-likelihood is used, i.e.
Figure BDA0002865060590000115
Finally, the trained prediction model can be used to decode the training set composed of the test corpus to obtain the machine probability value of the candidate word, and this probability value can be recorded as the second importance LCS (W) of the corresponding candidate wordi)。
S104: and determining the subject word according to the importance result of the candidate word set.
In this embodiment, after determining the importance result of the candidate word set, the first importance EMS (W) may be synthesized according to the importance result of the candidate word seti) And a second importance LCS (W)i) A subject term is determined. Specifically, for the candidate word set after the unsupervised algorithm and the prediction model are screened, for each candidate word W in the candidate word setiCalculating an importance score, namely determining the importance score of the subject word according to the importance result of the candidate word set, wherein the importance score is expressed as
S(Wi)=(1-μ)EMS(Wi)+μLCS(Wi)
Where μ denotes an assignment weight, for example, μ takes a value of 0.3. And (3) sequentially arranging the importance scores of the candidate words from large to small, sequentially selecting the candidate words corresponding to the importance scores of a preset number (for example, topN) from the candidate word corresponding to the importance score with the largest value, determining the selected candidate words as the subject words, and finishing the mining of the subject words.
As can be seen from the above description, one or more embodiments of the present application provide a topic word mining method, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject word according to the importance result of the candidate word set. The method provided by the application obtains the text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on subject words in the text data, filters the text data through the language model to determine a candidate word set, filters out characters with low degree of cohesion in the text data through the language model, and reduces the influence of loosely spliced characters on subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the mining range of the subject word is reduced; and further screening a candidate word set according to an unsupervised algorithm and a supervised prediction model, and excavating the subject words by sequencing the importance of each candidate word in the candidate word set. The complex structure of Chinese text corpora is fully considered, and through layer-by-layer screening, the subject terms of the text data are identified, meanwhile, potential subject terms formed by emerging professional vocabularies can be excavated according to the importance degree sequence, and the accuracy and efficiency of the subject term excavation and extraction are improved.
It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities.
It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.
It should be noted that the above description describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, one or more embodiments of the present application further provide a topic word mining apparatus, which, with reference to fig. 2, includes:
an acquisition module configured to acquire text data;
a filtering module configured to filter the textual data based on a language model to determine a set of candidate words;
a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine an importance result of the set of candidate words;
and the determining module is configured to determine the subject word according to the importance result of the candidate word set.
In some optional embodiments, the acquiring text data specifically includes:
acquiring a text corpus of an industry to be processed;
preprocessing the industry text corpus to obtain the text data; the preprocessing operation comprises: deleting redundant characters, determining text granularity and performing line division processing.
In some optional embodiments, the filtering the text data based on the language model to determine the candidate word set specifically includes:
determining word length and word frequency of words in the text data according to the text data based on the language model;
selecting the vocabulary in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by using a data mining strategy to determine candidate words;
and determining the candidate word set according to the candidate words based on the degree of solidification and the degree of freedom screening strategy.
In some optional embodiments, the determining the candidate word set according to the candidate words based on the solidity and the degree of freedom screening policy specifically includes:
determining a degree of solidity SD (W) of the candidate wordi) And degree of freedom FD (W)i) (ii) a Degree of solidity SD (W) of the candidate wordi) Is shown as
Figure BDA0002865060590000131
Wherein, WiRepresents said candidate word, Wi=C1C2...Cn,C1C2...CnRepresents a character, P () represents a probability function;
degree of freedom FD (W) of the candidate wordi) Is shown as
FD(Wi)=min{LE(Wi),RE(Wi)}
Wherein LE (W)i) Left-adjacent entropy, RE (W), representing the candidate wordi) Representing a right neighbor entropy of the candidate word;
selecting the solidity SD (W) based on the solidity and freedom screening strategyi) Not less than a freezing degree threshold and the degree of freedom FD (W)i) The candidate words not less than a degree of freedom threshold to determine the set of candidate words.
In some optional embodiments, the importance result of the candidate word set includes: EMS of first importance (W)i);
The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically includes:
determining the first EMS (W) of importance of each of the candidate words in the set of candidate words with respect to the text data according to the unsupervised algorithmi) (ii) a The first importance EMS (W)i) Is shown as
Figure BDA0002865060590000141
Wherein, TjRepresenting a text segment r obtained after segmenting the text datai() An iteration function is represented.
In some optional embodiments, the importance result of the candidate word set further includes: second importance LCS (W)i);
The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically comprises:
determining the second importance LCS (W) of each of the candidate words of the set of candidate words with respect to the text data according to the prediction modeli)。
In some optional embodiments, the determining a subject word according to the result of the importance of the candidate word set specifically includes:
determining the importance score of the subject word according to the importance result of the candidate word set;
and sequentially arranging the importance scores of the candidate words from large to small, selecting a preset number of the candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores in sequence, and determining the selected candidate words as the subject words.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.
The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above embodiments, one or more embodiments of the present specification further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the topic word mining method according to any of the above embodiments.
Fig. 3 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 310, a memory 320, an input/output interface 330, a communication interface 340, and a bus 350. Wherein the processor 310, memory 320, input/output interface 330, and communication interface 340 are communicatively coupled to each other within the device via bus 350.
The processor 310 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.
The Memory 320 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 320 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 320 and called to be executed by the processor 310.
The input/output interface 330 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 340 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 350 includes a path that transfers information between the various components of the device, such as processor 310, memory 320, input/output interface 330, and communication interface 340.
It should be noted that although the above-mentioned device only shows the processor 310, the memory 320, the input/output interface 330, the communication interface 340 and the bus 350, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above-described embodiment methods, one or more embodiments of the present specification further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the subject word mining method according to any of the above-described embodiment.
Non-transitory computer readable storage media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the topic word mining method according to any one of the above embodiments, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments in this application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
It is intended that the one or more embodiments of the present application embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A topic word mining method, comprising:
acquiring text data;
filtering the text data based on a language model to determine a set of candidate words;
screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set;
and determining the subject word according to the importance result of the candidate word set.
2. The topic word mining method according to claim 1, wherein the obtaining text data specifically comprises:
acquiring an industry text corpus to be processed;
preprocessing the industry text corpus to obtain the text data; the preprocessing operation comprises: deleting redundant characters, determining text granularity and performing line division processing.
3. The topic word mining method of claim 1, wherein filtering the textual data based on a language model to determine a set of candidate words comprises:
determining word length and word frequency of words in the text data according to the text data based on the language model;
selecting the vocabulary in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by using a data mining strategy to determine candidate words;
and determining the candidate word set according to the candidate words based on the degree of solidification and the degree of freedom screening strategy.
4. The topic word mining method according to claim 3, wherein the determining the candidate word set according to the candidate words based on the solidity and freedom degree screening strategy specifically comprises:
determining a degree of solidity SD (W) of the candidate wordi) And degree of freedom FD (W)i) (ii) a Degree of solidity SD (W) of the candidate wordi) Is shown as
Figure FDA0002865060580000011
Wherein, WiRepresents said candidate word, Wi=C1C2...Cn,C1C2...CnRepresenting characters, p () representing a probability function;
degree of freedom FD (W) of the candidate wordi) Is shown as
FD(Wi)=min{LE(Wi),RE(Wi)}
Wherein LE (W)i) Left-adjacent entropy, RE (W), representing the candidate wordi) Representing a right neighbor entropy of the candidate word;
selecting the solidity SD (W) based on the solidity and freedom screening strategyi) Not less than a freezing degree threshold and the degree of freedom FD (W)i) The candidate words not less than a degree of freedom threshold to determine the set of candidate words.
5. The topic word mining method of claim 3, wherein the importance result of the candidate word set comprises: first of allEMS importance (W)i);
The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically includes:
determining the first EMS (W) of importance of each of the candidate words in the set of candidate words with respect to the text data according to the unsupervised algorithmi) (ii) a The first importance EMS (W)i) Is shown as
Figure FDA0002865060580000021
Wherein, TjRepresenting a text segment r obtained after segmenting the text datai() An iteration function is represented.
6. The topic word mining method of claim 5, wherein the importance result of the candidate word set further comprises: second importance LCS (W)i);
The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically comprises:
determining the second importance LCS (W) of each of the candidate words of the set of candidate words with respect to the text data according to the prediction modeli)。
7. The method according to claim 6, wherein the determining the topic word according to the importance result of the candidate word set specifically comprises:
determining the importance scores of the candidate words according to the importance results of the candidate word set;
and sequentially arranging the importance scores of the candidate words from large to small, selecting a preset number of the candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores in sequence, and determining the selected candidate words as the subject words.
8. A topic word mining device, comprising:
an acquisition module configured to acquire text data;
a filtering module configured to filter the textual data based on a language model to determine a set of candidate words;
a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine an importance result of the set of candidate words;
and the determining module is configured to determine the subject word according to the importance result of the candidate word set.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the topic word mining method of any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the topic word mining method of any one of claims 1 to 7.
CN202011580178.XA 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium Active CN112784009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011580178.XA CN112784009B (en) 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011580178.XA CN112784009B (en) 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112784009A true CN112784009A (en) 2021-05-11
CN112784009B CN112784009B (en) 2023-08-18

Family

ID=75752926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011580178.XA Active CN112784009B (en) 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112784009B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360613A (en) * 2021-05-31 2021-09-07 维沃移动通信有限公司 Text processing method and device and electronic equipment
CN113468332A (en) * 2021-07-14 2021-10-01 广州华多网络科技有限公司 Classification model updating method and corresponding device, equipment and medium
CN116976351A (en) * 2023-09-22 2023-10-31 之江实验室 Language model construction method based on subject entity and subject entity recognition device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299230A (en) * 2018-09-06 2019-02-01 华泰证券股份有限公司 A kind of customer service public sentiment hot word data digging system and method
US20190095526A1 (en) * 2017-09-22 2019-03-28 Druva Technologies Pte. Ltd. Keyphrase extraction system and method
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111444712A (en) * 2020-03-25 2020-07-24 重庆邮电大学 Keyword extraction method, terminal and computer readable storage medium
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190095526A1 (en) * 2017-09-22 2019-03-28 Druva Technologies Pte. Ltd. Keyphrase extraction system and method
CN109299230A (en) * 2018-09-06 2019-02-01 华泰证券股份有限公司 A kind of customer service public sentiment hot word data digging system and method
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111444712A (en) * 2020-03-25 2020-07-24 重庆邮电大学 Keyword extraction method, terminal and computer readable storage medium
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360613A (en) * 2021-05-31 2021-09-07 维沃移动通信有限公司 Text processing method and device and electronic equipment
WO2022253138A1 (en) * 2021-05-31 2022-12-08 维沃移动通信有限公司 Text processing method and apparatus, and electronic device
CN113468332A (en) * 2021-07-14 2021-10-01 广州华多网络科技有限公司 Classification model updating method and corresponding device, equipment and medium
CN116976351A (en) * 2023-09-22 2023-10-31 之江实验室 Language model construction method based on subject entity and subject entity recognition device
CN116976351B (en) * 2023-09-22 2024-01-23 之江实验室 Language model construction method based on subject entity and subject entity recognition device

Also Published As

Publication number Publication date
CN112784009B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
US11455542B2 (en) Text processing method and device based on ambiguous entity words
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
TW202020691A (en) Feature word determination method and device and server
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN106980664B (en) Bilingual comparable corpus mining method and device
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN107861948B (en) Label extraction method, device, equipment and medium
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
US9575957B2 (en) Recognizing chemical names in a chinese document
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
US20190095525A1 (en) Extraction of expression for natural language processing
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN111310442B (en) Method for mining shape-word error correction corpus, error correction method, device and storage medium
CN114691907A (en) Cross-modal retrieval method, device and medium
JP6486789B2 (en) Speech recognition apparatus, speech recognition method, and program
CN112559739A (en) Method for processing insulation state data of power equipment
JP6805927B2 (en) Index generator, data search program, index generator, data search device, index generation method, and data search method
JP4985096B2 (en) Document analysis system, document analysis method, and computer program
CN113553410B (en) Long document processing method, processing device, electronic equipment and storage medium
CN111540363B (en) Keyword model and decoding network construction method, detection method and related equipment
KR102500106B1 (en) Apparatus and Method for construction of Acronym Dictionary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant