CN112784009B - Method and device for mining subject term, electronic equipment and storage medium - Google Patents

Method and device for mining subject term, electronic equipment and storage medium Download PDF

Info

Publication number
CN112784009B
CN112784009B CN202011580178.XA CN202011580178A CN112784009B CN 112784009 B CN112784009 B CN 112784009B CN 202011580178 A CN202011580178 A CN 202011580178A CN 112784009 B CN112784009 B CN 112784009B
Authority
CN
China
Prior art keywords
importance
candidate
words
text data
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011580178.XA
Other languages
Chinese (zh)
Other versions
CN112784009A (en
Inventor
熊永平
曹滔宇
朱承治
谷纪亭
徐翀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Beijing University of Posts and Telecommunications
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Beijing University of Posts and Telecommunications filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN202011580178.XA priority Critical patent/CN112784009B/en
Publication of CN112784009A publication Critical patent/CN112784009A/en
Application granted granted Critical
Publication of CN112784009B publication Critical patent/CN112784009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

One or more embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for mining a subject term, including: acquiring text data; filtering the text data based on the language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject term according to the importance result of the candidate word set. According to the application, characters with low aggregation degree in text data are filtered through a language model, the influence of the loosely spliced characters on the mining of the subject words is reduced, the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, the mining range of the subject words is reduced, the complex structure of Chinese text corpus is fully considered, and potential subject words formed by emerging professional vocabularies can be mined according to importance sorting while the subject words of the text data are identified through layer-by-layer screening.

Description

Method and device for mining subject term, electronic equipment and storage medium
Technical Field
One or more embodiments of the present application relate to the field of data mining technologies, and in particular, to a method, an apparatus, an electronic device, and a storage medium for mining a subject term.
Background
In the prior art, the text body quantity of the project text duplicate checking work surface is large, the granularity of the project text is high, and the quick retrieval of similar documents in a document database becomes a primary difficult problem for improving the accuracy and efficiency of duplicate checking work. Since technical projects or scientific research documents generally surround a plurality of keywords, and the keywords reflect the gist of the text description to a certain extent, the similarity between texts can be checked only by finding and comparing industry subject words sent in each text.
For Chinese text, the method has a more complex organization structure than English text, but in the prior art, the method for mining the keywords of English text is generally applied to mining the keywords of Chinese text or extracting the keywords based on manual work, so that the method has the problems of low accuracy and incapability of accurately mining the potential keywords consisting of emerging professional vocabularies in Chinese text.
Disclosure of Invention
In view of the foregoing, it is an object of one or more embodiments of the present application to provide a subject word mining method, apparatus, electronic device and storage medium, so as to solve at least one of the above problems in the prior art.
In view of the above object, one or more embodiments of the present application provide a method for mining a subject term, including:
acquiring text data;
filtering the text data based on a language model to determine a set of candidate words;
screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set;
and determining the subject term according to the importance result of the candidate word set.
Optionally, the acquiring text data specifically includes:
acquiring an industry text corpus to be processed;
preprocessing the industrial text corpus to obtain the text data; the preprocessing operation includes: deleting redundant characters, determining text granularity and performing line separation processing.
Optionally, the filtering the text data based on the language model to determine a candidate word set specifically includes:
determining word lengths and word frequencies of words in the text data according to the text data based on the language model;
selecting words in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by utilizing a data mining strategy to determine candidate words;
and determining the candidate word set according to the candidate words based on the solidification degree and the freedom degree screening strategy.
Optionally, the determining the candidate word set according to the candidate word based on the solidification degree and the freedom degree screening policy specifically includes:
determining the degree of solidification SD (W i ) And degree of freedom FD (W i ) The method comprises the steps of carrying out a first treatment on the surface of the The degree of solidification SD (W i ) Represented as
Wherein W is i Representing the candidate word, W i =C 1 C 2 ...C n ,C 1 C 2 ...C n Representing characters, p () representing a probability function;
degree of freedom FD (W) i ) Represented as
FD(W i )=min{LE(W i ),RE(W i )}
Wherein LE (W) i ) Entropy of the left neighbor representing the candidate word, RE (W i ) Right neighbor entropy representing the candidate word;
selecting the degree of solidification SD (W i ) Not less than the threshold of solidification degree and the degree of freedom FD (W i ) The candidate word is not less than a degree of freedom threshold to determine the set of candidate words.
Optionally, the importance result of the candidate word set includes: first importance EMS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the first importance EMS (W i ) The method comprises the steps of carrying out a first treatment on the surface of the By a means ofThe first importance EMS (W i ) Represented as
Wherein T is j Representing a text segment obtained by cutting the text data, r i () Representing an iterative function.
Optionally, the importance result of the candidate word set further includes: second importance LCS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the second importance LCS (W) of each of the candidate words in the set of candidate words with respect to the text data according to the predictive model i )。
Optionally, the determining the subject term according to the importance result of the candidate term set specifically includes:
determining importance scores of the candidate words according to importance results of the candidate word sets;
sequentially arranging importance scores of the candidate words from large to small, sequentially selecting a preset number of candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores, and determining the selected candidate words as the subject words.
Based on the same inventive concept, one or more embodiments of the present application further provide a subject word mining apparatus, including:
an acquisition module configured to acquire text data;
a filtering module configured to filter the text data based on a language model to determine a set of candidate words;
a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine a importance result for the set of candidate words;
and the determining module is configured to determine the subject word according to the importance result of the candidate word set.
Based on the same inventive concept, one or more embodiments of the present application further provide an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the subject word mining method described in any one of the above when executing the program.
Based on the same inventive concept, one or more embodiments of the present application also provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the subject matter mining method described in any one of the above.
From the foregoing, it can be seen that one or more embodiments of the present application provide a method for mining a subject term, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set; and determining the subject term according to the importance result of the candidate word set. According to the method provided by the application, the text data is obtained by preprocessing the obtained text corpus to be processed, the text corpus to be processed is cleaned, the influence of redundant characters on subject words in the text data is reduced, the text data is filtered through a language model to determine a candidate word set, characters with low condensation degree in the text data are filtered through the language model, and the influence of the loosely spliced characters on subject word mining is reduced; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the excavation range of the subject words is reduced; and further screening the candidate word set according to an unsupervised algorithm and a supervised prediction model, and mining the subject word by sequencing the importance of each candidate word in the candidate word set. The complex structure of the Chinese text corpus is fully considered, and potential subject words formed by emerging professional vocabularies can be mined according to importance degree sequencing while the subject words of the text data are identified through layer-by-layer screening.
Drawings
In order to more clearly illustrate one or more embodiments of the present application or the prior art solutions, the following description will briefly explain the drawings used in the embodiments or the prior art descriptions, and it is apparent that the drawings in the following description are only one or more embodiments of the present application and that other drawings can be obtained according to these drawings without inventive effort to those skilled in the art.
FIG. 1 is a flow diagram of a method of subject matter mining in accordance with one or more embodiments of the present application;
FIG. 2 is a schematic diagram of a subject matter mining apparatus according to one or more embodiments of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to one or more embodiments of the present application.
Detailed Description
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.
It is noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of the terms "first," "second," and the like in one or more embodiments of the present application does not denote any order, quantity, or importance, but rather the terms "first," "second," and the like are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
As described in the background art, in the prior art, as the text in the professional field becomes increasingly abundant, the number of documents in the document database increases exponentially, so that in order to make the scientific research expense input into the research project with high value density, a multi-stage examination link needs to be set in the process of setting up the scientific research project, but in the prior art, the key word is manually extracted or the key research content is generally adopted to perform the repeated work of checking the scientific and technological project compared with the history data of the researched or in-process scientific and technological project.
The applicant finds that in the prior art, the manual check-up mode has high requirements on the professional technical level of staff, and has the problems of low check-up efficiency, check-up omission, error rate and the like. In addition, the research project research work of the prior art faces the challenges of larger text corpus and higher granularity of the project text corpus. Therefore, a method for improving accuracy and efficiency of the repeated work of scientific projects and scientific research documents is needed, i.e. a method for quickly searching similar documents in a document database is needed. The applicant finds that the center of a scientific research project or a scientific research literature is usually surrounded by a plurality of keywords, and the keywords reflect the gist and the meaning of text description to a certain extent, so that for the scientific and technological project in the industry and the research literature check and repeat work, the similarity between texts can be checked only by finding and comparing the industry subject words sent in each text. Moreover, unlike english text, chinese text is described with a more complex organization structure, so chinese documents have a higher processing difficulty than english documents, and the method of manually searching for a repeated pattern or using a subject word for mining english text in the prior art has the problems of low accuracy and failure to accurately mine a potential subject word composed of an emerging professional vocabulary in chinese text.
Therefore, the method provided by the application obtains text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on subject words in the text data, filters the text data through a language model to determine a candidate word set, filters out characters with low condensation degree in the text data through the language model, and reduces the influence of the loosely spliced characters on subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the excavation range of the subject words is reduced; and further screening the candidate word set according to an unsupervised algorithm and a supervised prediction model, and mining the subject word by sequencing the importance of each candidate word in the candidate word set. The complex structure of the Chinese text corpus is fully considered, and the keywords of the text data are identified through layer-by-layer screening, so that potential keywords consisting of emerging professional vocabularies can be mined according to importance degree sequencing, and the accuracy and efficiency of keyword mining and extraction are improved.
The technical scheme of the present disclosure is further described in detail below through specific examples.
Referring to fig. 1, therefore, one or more embodiments of the present application provide a method for mining a subject term, which specifically includes the following steps:
s101: text data is acquired.
In this embodiment, text data to be subject word mined is obtained, specifically, an industrial text corpus to be processed may be obtained, then a preprocessing operation is performed on the industrial text corpus to be processed, and the text data may be obtained by cleaning the industrial text corpus to be processed. The business text corpus to be processed can be text type data such as electronic books, news documents, research papers, digital libraries, WEB pages, emails, database records and the like.
In some alternative embodiments, the preprocessing operation may include: deleting redundant characters, determining text granularity, line segmentation processing, chinese word segmentation, deleting format markers, and the like. For example, for web page text data, the web page tags need to be removed, resulting in plain text.
S102: the text data is filtered based on a language model to determine a set of candidate words.
In this embodiment, after the text data is obtained, since the vocabulary in the text data is huge, the text data may be filtered based on the language model, so as to determine the candidate word set, and the vocabulary in the candidate word set represents the vocabulary in the text data. Specifically, word length and word frequency of words in text data can be determined according to text data based on a language model, words in the text data with word length not greater than a word length threshold and word frequency not less than a word frequency threshold are selected by utilizing a data mining strategy to determine candidate words, and after the candidate words are determined, a candidate word set is determined according to the candidate words based on a freezing degree and freedom degree screening strategy.
In some alternative embodiments, the language model may select an N-gram model, where the range of values for N is [2,10]. After preprocessing operation is performed on industrial text corpus to be processed, granularity of the text data can be determined, the text data is segmented into a plurality of text fragments, wherein the text fragments can be sentences, paragraphs or even the whole document is used as a text fragment, the text data is arranged into a set of text fragments, and punctuation marks in the text fragments are filtered.
In some alternative embodiments, for each word in the text data, the frequency of occurrence F (W i ) And the word length of the vocabulary, a word length threshold tl=10 and a word frequency threshold tf=3 may be preset. According to the granularity of the text fragments, a data mining strategy (such as an aprri strategy) is used for selecting words in text data with word length not larger than a word length threshold value and word frequency not smaller than a word frequency threshold value, and the words meeting the conditions are determined to be candidate words. An overcomplete dictionary may be composed from candidate words and normalized word frequencies (e.g., interval 0,1]) To initialize the usage of each candidate word.
In some alternative embodiments, after determining the candidate word, the set of candidate words may be determined from the candidate word based on a degree of solidification and a degree of freedom screening policy. Since each candidate word may be composed of a plurality of characters, the candidate word may be represented as W i =C 1 C 2 ...C n-1 C n ,C 1 C 2 ...C n Representing the character. For any one candidate word W i In other words, it can analyze from the inner and outer two parts whether the current candidate word is a word that can be used independently, and for the inner, it can analyze the inner degree of fusion, i.e. whether the characters are spliced together tightly enough, and the larger the degree of fusion, the more likely the candidate word becomes an independent word; for the outside, it can analyze whether the current candidate word is independent and free to operate in the whole text data, i.e. how large the uncertainty of the left and right adjacent words of the candidate word is, i.e. how large the entropy is, and the larger the entropy value is, the independent and free to operate the candidate word is indicated.
It should be noted that, determining the candidate word set according to the candidate word based on the solidification degree and the degree of freedom screening policy may specifically include: determining the coagulability SD (W) i ) And degree of freedom FD (W i ) Wherein, the coagulability SD (W i ) Represented as
Where p () represents a probability function. While when the degree of freedom of the candidate word is determined, the left-neighbor entropy and the right-neighbor entropy of the candidate word can be determined first, specifically, when the left-neighbor entropy is calculated, all left-neighbor word sets of any one candidate word are defined as
S left (W i )={C 0,i ,i=1,2,...,m}
Wherein C is 0,i Representing the left adjacency word. Further, the probability of occurrence of each character of the left adjacent word can be expressed as
Where count () represents a statistical function. In combination with the parameters, the left-neighbor entropy of any candidate word can be expressed as
Similarly, when right-neighbor entropy is calculated, all right-neighbor word sets of candidate words are defined as
S right (W i )={C n+1,i ,i=1,2,...,m}
Wherein C is n+1,i Represented as right adjacency word. Thus, the right-neighbor entropy of any one candidate word can be expressed as
After determining the solidification degree and the freedom degree of the candidate words, selecting the candidate words with the solidification degree not smaller than a preset solidification degree threshold value and the freedom degree not smaller than a preset freedom degree threshold value based on a solidification degree and freedom degree screening strategy, and further screening to obtain a candidate word set. The preset coagulation degree threshold value can be 5.3, the preset freedom degree threshold value can be 0.75, and in actual application, the preset coagulation degree threshold value and the freedom degree threshold value can be dynamically adjusted according to the condition of corpus.
S103: the set of candidate words is filtered based on an unsupervised algorithm and a predictive model to determine importance results for the set of candidate words.
In this embodiment, after the candidate word set is determined, the candidate word set may be screened based on an unsupervised algorithm and in combination with a prediction model obtained by supervised training, so as to obtain an importance result of the candidate word set. Wherein, before screening the candidate word set by using an unsupervised algorithm, the following parameters can be defined: assuming that a sentence is made up of a set of words, each word is made up of a set of characters, a dictionary in text data is defined as
D={W 1 ,W 2 ,...,W N }。
It is assumed that each sentence in the text data constitutes a sampling probability for each candidate word by concatenating a set of randomly sampled candidate words from the dictionary DThe rate is theta i . Defining dictionary probability parameters as
For a sentence generated by K vocabularies, it can be expressed as
S=W i1 W i2 ...W iK
The probability of generating the sentence is expressed as
Wherein P () represents a probability matrix comprising a plurality of probability values; for a given piece of text T that is not cut, define C T For a set of all segmented sentences according to the dictionary D, the probability of an un-segmented text segment T can be expressed as
It should be noted that, the importance result of the candidate word set includes: first importance EMS (W i ) The method comprises the steps of carrying out a first treatment on the surface of the Screening the candidate word set based on an unsupervised algorithm specifically comprises the following steps: determining a first importance EMS (W) of each candidate word in the set of candidate words relative to the text data according to an unsupervised algorithm (which may also be referred to as an EMwords algorithm) i ). Determining a first importance EMS (W i ) It is necessary to define theta r Is the parameter of the iterative estimation of the r-th round, determines the iterative function by using the expected step (E-step) of the unsupervised algorithm, expressed as
Obtaining an iterative function Q (θ, θ) using a maximum step (M-step) of an unsupervised algorithm r ) The parameter θ can be performed using the following expression r Is updated by:
wherein n is i (S) represents probability of occurrence of candidate word in sentence S, n i (T j ) Representing candidate words in a text segment T j The probability of occurrence in (c) can be expressed as
The sum of the probabilities of candidate words occurring in each text segment can be expressed as
In combination with the above parameters, a first importance EMS (W i ) Expressed as
Wherein r is i () Represents an iterative function, r i (T j ) Representing importance parameters, which can be expressed as
Where I () represents a selection function, i=1 when the expression in brackets holds, otherwise i=0,representing the optimal iteration parameters. R is as follows i (T j ) Is an importance parameter and makes a negative mappingThe first importance EMS (W i ) At the same time limit the interval [0,1 ]]For example, a word of high importance in a sentence generally occurs more frequently, but the importance parameter r i (T j ) The value corresponding to the vocabulary with high appearance frequency is smaller, namely the importance parameter r i (T j ) The calculated value is inversely proportional to the importance of the vocabulary, so its interval can be limited by using a logarithmic function while making a negative mapping.
In some alternative embodiments, the feature that the subject word is theoretically solid, does not need to be marked with data and does not depend on a knowledge base is found by using an unsupervised algorithm, and the preliminary screening result obtained after the candidate word set is determined from the text data and screened according to importance is a result with a high lower limit. And then, the supervised and trained prediction model can be utilized for further screening, and for text data in different fields, the obtained public and high-quality expert labels field vocabulary, which is used as a knowledge base or training set, can be used for training the supervised model, so that the effect of the prediction model is improved. For example, in the training data preparation stage, 110 field professional words are obtained by collecting related authoritative field professional dictionaries and journal paper keywords, and in addition, 120 ten thousand universal words are obtained from the online dog searching 30G news corpus. Labeling the domain finding vocabulary and the universal vocabulary respectively, wherein B-COMMON and I-COMMON represent universal vocabulary labels, B-ELEC and I-ELEC represent domain vocabulary labels, prefix B and prefix I represent start marks and internal marks of the vocabulary labels respectively, then segmenting words in the existing professional Word stock and universal Word stock into single characters, and inputting the single characters into a Word2vec network for training to generate character vectors with m dimensions.
In some alternative embodiments, the importance result of the candidate word set further includes: second importance LCS (W i ) Screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set, further comprising: determining a second importance LCS (W) i ). Concrete embodimentsIn the method, a predictive model for subject word discovery can adopt a BiLSTM+CRF structure based on word vectors, and compared with the traditional BiLSTM model, a CRF layer is added to learn an optimal path, and the dimension of an output vector of the BiLSTM layer is defined as tag-size, namely, the dimension is equivalent to each candidate word w i Mapping to labeled tag j Let the output matrix of BiLSTM layer be P, then P i,j Representing candidate word w i Mapping to tags j Is not normalized to the probability of (a). For the CRF layer, it is assumed that there is a transfer matrix a, where a i,j Representing tag i Transfer to tag j Is a transition probability of (a). For the sequence y of output tags corresponding to the input sequence X (i.e. the number of tags) a score is defined as
Wherein,,representing the transition probability of the ith tag to the (i+1) th tag,/for the transition>Representing candidate word w i Non-normalized probabilities mapped to the ith tag.
Normalizing the sequence y of each tag label corresponding to the correct tag by using a Softmax function to obtain a probability value, namely likelihood probability, expressed as
Wherein Y is x The sequence y of all tag tags corresponding to the input sequence X is represented. Therefore, in training, only the likelihood probability p (y|X) needs to be maximized, where log-likelihood is employed, i.e
Finally, the training set formed by the test corpus can be decoded by using the trained prediction model to obtain the probability value of the candidate word machine, and the probability value can be recorded as the second importance LCS (W) of the corresponding candidate word i )。
S104: and determining the subject term according to the importance result of the candidate word set.
In this embodiment, after determining the importance result of the candidate word set, the importance result of the candidate word set may be determined according to the first importance EMS (W i ) And a second importance LCS (W i ) The subject term is determined. Specifically, for the candidate word set screened by the unsupervised algorithm and the prediction model, for each candidate word W therein i Calculating importance scores, i.e. determining the importance scores of the subject words based on the importance results of the candidate word sets, the importance scores being expressed as
S(W i )=(1-μ)EMS(W i )+μLCS(W i )
Where μ represents an assigned weight, e.g., μ takes a value of 0.3. Sequentially arranging importance scores of candidate words from large to small, sequentially selecting a preset number (e.g. topN) of candidate words corresponding to the importance scores from the candidate words corresponding to the importance scores with the largest numerical value, determining the selected candidate words as subject words, and completing mining of the subject words.
From the foregoing, it can be seen that one or more embodiments of the present application provide a method for mining a subject term, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set; and determining the subject term according to the importance result of the candidate word set. According to the method provided by the application, the text data is obtained by preprocessing the obtained text corpus to be processed, the text corpus to be processed is cleaned, the influence of redundant characters on subject words in the text data is reduced, the text data is filtered through a language model to determine a candidate word set, characters with low condensation degree in the text data are filtered through the language model, and the influence of the loosely spliced characters on subject word mining is reduced; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the excavation range of the subject words is reduced; and further screening the candidate word set according to an unsupervised algorithm and a supervised prediction model, and mining the subject word by sequencing the importance of each candidate word in the candidate word set. The complex structure of the Chinese text corpus is fully considered, and the keywords of the text data are identified through layer-by-layer screening, so that potential keywords consisting of emerging professional vocabularies can be mined according to importance degree sequencing, and the accuracy and efficiency of keyword mining and extraction are improved.
It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities.
It should be noted that the methods of one or more embodiments of the present description may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of one or more embodiments of the present description, the devices interacting with each other to accomplish the methods.
It should be noted that the foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same inventive concept, one or more embodiments of the present application further provide a subject word mining apparatus, referring to fig. 2, including:
an acquisition module configured to acquire text data;
a filtering module configured to filter the text data based on a language model to determine a set of candidate words;
a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine a importance result for the set of candidate words;
and the determining module is configured to determine the subject word according to the importance result of the candidate word set.
In some optional embodiments, the acquiring text data specifically includes:
acquiring text corpus of industries to be processed;
preprocessing the industrial text corpus to obtain the text data; the preprocessing operation includes: deleting redundant characters, determining text granularity and performing line separation processing.
In some optional embodiments, the filtering the text data based on the language model to determine the candidate word set specifically includes:
determining word lengths and word frequencies of words in the text data according to the text data based on the language model;
selecting words in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by utilizing a data mining strategy to determine candidate words;
and determining the candidate word set according to the candidate words based on the solidification degree and the freedom degree screening strategy.
In some optional embodiments, the determining the candidate word set according to the candidate word based on the solidification degree and the freedom degree screening policy specifically includes:
determining the degree of solidification SD (W i ) And degree of freedom FD (W i ) The method comprises the steps of carrying out a first treatment on the surface of the The degree of solidification SD (W i ) Represented as
Wherein W is i Representing the candidate word, W i =C 1 C 2 ...C n ,C 1 C 2 ...C n Representing characters, P () representing a probability function;
degree of freedom FD (W) i ) Represented as
FD(W i )=min{LE(W i ),RE(W i )}
Wherein LE (W) i ) Entropy of the left neighbor representing the candidate word, RE (W i ) Right neighbor entropy representing the candidate word;
selecting the degree of solidification SD (W i ) Not less than the threshold of solidification degree and the degree of freedom FD (W i ) The candidate word is not less than a degree of freedom threshold to determine the set of candidate words.
In some alternative embodiments, the importance result of the candidate word set includes: first importance EMS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the first importance EMS (W i ) The method comprises the steps of carrying out a first treatment on the surface of the The first importance EMS (W i ) Represented as
Wherein T is j Representing a text segment obtained by cutting the text data, r i () Representing an iterative function.
In some optional embodiments, the importance result of the candidate word set further includes: second importance LCS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the second importance LCS (W) of each of the candidate words in the set of candidate words with respect to the text data according to the predictive model i )。
In some optional embodiments, the determining the subject term according to the importance result of the candidate term set specifically includes:
determining the importance scores of the subject terms according to the importance results of the candidate word sets;
sequentially arranging importance scores of the candidate words from large to small, sequentially selecting a preset number of candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores, and determining the selected candidate words as the subject words.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in one or more pieces of software and/or hardware when implementing one or more embodiments of the present description.
The device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, one or more embodiments of the present disclosure further provide an electronic device, corresponding to the method of any of the embodiments, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the subject term mining method of any of the embodiments when executing the program.
Fig. 3 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 310, a memory 320, an input/output interface 330, a communication interface 340, and a bus 350. Wherein the processor 310, the memory 320, the input/output interface 330 and the communication interface 340 are communicatively coupled to each other within the device via a bus 350.
The processor 310 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 320 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 320 may store an operating system and other application programs, and when implementing the techniques provided by the embodiments of the present disclosure via software or firmware, the associated program code is stored in memory 320 and invoked for execution by processor 310.
The input/output interface 330 is used for connecting with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 340 is used to connect to a communication module (not shown in the figure) to enable communication interaction between the present device and other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 350 includes a path to transfer information between components of the device (e.g., processor 310, memory 320, input/output interface 330, and communication interface 340).
It should be noted that although the above device only shows the processor 310, the memory 320, the input/output interface 330, the communication interface 340, and the bus 350, in the implementation, the device may further include other components necessary to achieve normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.
Based on the same inventive concept, one or more embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the subject matter mining method as described in any of the embodiments above, corresponding to the method of any of the embodiments above.
The non-transitory computer readable storage media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the foregoing embodiments stores computer instructions for causing the computer to execute the subject word mining method according to any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and many other variations exist in the different aspects of one or more embodiments of the present application as described above, which are not provided in detail for simplicity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure one or more embodiments of the present description. Furthermore, the apparatus may be shown in block diagram form in order to avoid obscuring the one or more embodiments of the present description, and also in view of the fact that specifics with respect to implementation of such block diagram apparatus are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present application is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the one or more embodiments of the application, are intended to be included within the scope of the present disclosure.

Claims (9)

1. A method of subject matter mining, comprising:
acquiring text data;
filtering the text data based on a language model to determine a set of candidate words;
screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set; the importance result of the candidate word set comprises: first importance EMS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the first importance EMS (W i ) The method comprises the steps of carrying out a first treatment on the surface of the The first importance EMS (W i ) Represented as
Wherein T is j Representing a text segment obtained by cutting the text data, r i () Representing an iterative function;
r i ( j ) Representing importance parameters, which can be expressed as
Where I () represents a selection function, S represents a sentence generated by an arbitrary number of words,representing the sentence segmented in the dictionary D, W i Representing candidate words, P () representing a probability function, D representing a dictionary,/>Representing dictionary probability parameters;
and determining the subject term according to the importance result of the candidate word set.
2. The subject matter mining method of claim 1, wherein the obtaining text data specifically includes:
acquiring an industry text corpus to be processed;
preprocessing the industrial text corpus to obtain the text data; the preprocessing operation includes: deleting redundant characters, determining text granularity and performing line separation processing.
3. The subject matter mining method of claim 1, wherein the filtering the text data based on a language model to determine a set of candidate words comprises:
determining word lengths and word frequencies of words in the text data according to the text data based on the language model;
selecting words in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by utilizing a data mining strategy to determine candidate words;
and determining the candidate word set according to the candidate words based on the solidification degree and the freedom degree screening strategy.
4. The subject matter mining method of claim 3, wherein the determining the set of candidate words from the candidate words based on a degree of solidification and a degree of freedom screening strategy specifically comprises:
determining the degree of solidification SD (W i ) And degree of freedom FD (W i ) The method comprises the steps of carrying out a first treatment on the surface of the The degree of solidification SD (W i ) Represented as
Wherein W is i Representing the candidate word, W i =C 1 C 2 ... n ,C 1 C 2 ...C n Representing characters, p () representsA probability function;
degree of freedom FD (W) i ) Represented as
FD(W i )=min{LE(W i ),RE(W i )}
Wherein LE (W) i ) Entropy of the left neighbor representing the candidate word, RE (W i ) Right neighbor entropy representing the candidate word;
selecting the degree of solidification SD (W i ) Not less than the threshold of solidification degree and the degree of freedom FD (W i ) The candidate word is not less than a degree of freedom threshold to determine the set of candidate words.
5. The subject matter mining method of claim 1, wherein the importance result for the candidate word set further comprises: second importance LCS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the second importance LCS (W) of each of the candidate words in the set of candidate words with respect to the text data according to the predictive model i )。
6. The method for mining subject matter of claim 5, wherein determining the subject matter from the importance results of the candidate word set comprises:
determining importance scores of the candidate words according to importance results of the candidate word sets;
sequentially arranging importance scores of the candidate words from large to small, sequentially selecting a preset number of candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores, and determining the selected candidate words as the subject words.
7. A subject matter mining apparatus, comprising:
an acquisition module configured to acquire text data;
a filtering module configured to filter the text data based on a language model to determine a set of candidate words;
a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine a importance result for the set of candidate words; the importance result of the candidate word set comprises: first importance EMS (W i );
The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:
determining the first importance EMS (W i ) The method comprises the steps of carrying out a first treatment on the surface of the The first importance EMS (W i ) Represented as
Wherein T is j Representing a text segment obtained by cutting the text data, r i () Representing an iterative function;
r i (T j ) Representing importance parameters, which can be expressed as
Where I () represents a selection function, S represents a sentence generated by an arbitrary number of words,representing the sentence segmented in the dictionary D, W i Representing candidate words, P () representing a probability function, D representing a dictionary,/>Representing dictionary probability parameters;
and the determining module is configured to determine the subject word according to the importance result of the candidate word set.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the subject matter word mining method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the subject matter mining method of any of claims 1 to 6.
CN202011580178.XA 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium Active CN112784009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011580178.XA CN112784009B (en) 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011580178.XA CN112784009B (en) 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112784009A CN112784009A (en) 2021-05-11
CN112784009B true CN112784009B (en) 2023-08-18

Family

ID=75752926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011580178.XA Active CN112784009B (en) 2020-12-28 2020-12-28 Method and device for mining subject term, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112784009B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360613A (en) * 2021-05-31 2021-09-07 维沃移动通信有限公司 Text processing method and device and electronic equipment
CN113468332A (en) * 2021-07-14 2021-10-01 广州华多网络科技有限公司 Classification model updating method and corresponding device, equipment and medium
CN115859948A (en) * 2022-06-14 2023-03-28 北京中关村科金技术有限公司 Method, device and storage medium for mining domain vocabulary based on correlation analysis algorithm
CN116976351B (en) * 2023-09-22 2024-01-23 之江实验室 Language model construction method based on subject entity and subject entity recognition device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10417268B2 (en) * 2017-09-22 2019-09-17 Druva Technologies Pte. Ltd. Keyphrase extraction system and method
CN109299230A (en) * 2018-09-06 2019-02-01 华泰证券股份有限公司 A kind of customer service public sentiment hot word data digging system and method
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111444712B (en) * 2020-03-25 2022-08-30 重庆邮电大学 Keyword extraction method, terminal and computer readable storage medium
CN111931501B (en) * 2020-09-22 2021-01-08 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment

Also Published As

Publication number Publication date
CN112784009A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
US20200081899A1 (en) Automated database schema matching
CN109933656B (en) Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
CN107463548B (en) Phrase mining method and device
CN107229627B (en) Text processing method and device and computing equipment
CN113961685A (en) Information extraction method and device
US20220019739A1 (en) Item Recall Method and System, Electronic Device and Readable Storage Medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111506726B (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN107515849A (en) It is a kind of into word judgment model generating method, new word discovery method and device
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN113822059A (en) Chinese sensitive text recognition method and device, storage medium and equipment
CN112328655A (en) Text label mining method, device, equipment and storage medium
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN115858773A (en) Keyword mining method, device and medium suitable for long document
US20190095525A1 (en) Extraction of expression for natural language processing
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN115329754A (en) Text theme extraction method, device and equipment and storage medium
CN111310442B (en) Method for mining shape-word error correction corpus, error correction method, device and storage medium
CN115017906A (en) Method, device and storage medium for identifying entities in text
Pham et al. A deep learning approach for text segmentation in document analysis
CN109299260B (en) Data classification method, device and computer readable storage medium
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant