CN112784009B

CN112784009B - Method and device for mining subject term, electronic equipment and storage medium

Info

Publication number: CN112784009B
Application number: CN202011580178.XA
Authority: CN
Inventors: 熊永平; 曹滔宇; 朱承治; 谷纪亭; 徐翀
Original assignee: State Grid Zhejiang Electric Power Co Ltd; Beijing University of Posts and Telecommunications
Current assignee: State Grid Zhejiang Electric Power Co Ltd; Beijing University of Posts and Telecommunications
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2023-08-18
Anticipated expiration: 2040-12-28
Also published as: CN112784009A

Abstract

One or more embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for mining a subject term, including: acquiring text data; filtering the text data based on the language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject term according to the importance result of the candidate word set. According to the application, characters with low aggregation degree in text data are filtered through a language model, the influence of the loosely spliced characters on the mining of the subject words is reduced, the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, the mining range of the subject words is reduced, the complex structure of Chinese text corpus is fully considered, and potential subject words formed by emerging professional vocabularies can be mined according to importance sorting while the subject words of the text data are identified through layer-by-layer screening.

Description

Method and device for mining subject term, electronic equipment and storage medium

Technical Field

One or more embodiments of the present application relate to the field of data mining technologies, and in particular, to a method, an apparatus, an electronic device, and a storage medium for mining a subject term.

Background

In the prior art, the text body quantity of the project text duplicate checking work surface is large, the granularity of the project text is high, and the quick retrieval of similar documents in a document database becomes a primary difficult problem for improving the accuracy and efficiency of duplicate checking work. Since technical projects or scientific research documents generally surround a plurality of keywords, and the keywords reflect the gist of the text description to a certain extent, the similarity between texts can be checked only by finding and comparing industry subject words sent in each text.

For Chinese text, the method has a more complex organization structure than English text, but in the prior art, the method for mining the keywords of English text is generally applied to mining the keywords of Chinese text or extracting the keywords based on manual work, so that the method has the problems of low accuracy and incapability of accurately mining the potential keywords consisting of emerging professional vocabularies in Chinese text.

Disclosure of Invention

In view of the foregoing, it is an object of one or more embodiments of the present application to provide a subject word mining method, apparatus, electronic device and storage medium, so as to solve at least one of the above problems in the prior art.

In view of the above object, one or more embodiments of the present application provide a method for mining a subject term, including:

acquiring text data;

filtering the text data based on a language model to determine a set of candidate words;

screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set;

and determining the subject term according to the importance result of the candidate word set.

Optionally, the acquiring text data specifically includes:

acquiring an industry text corpus to be processed;

preprocessing the industrial text corpus to obtain the text data; the preprocessing operation includes: deleting redundant characters, determining text granularity and performing line separation processing.

Optionally, the filtering the text data based on the language model to determine a candidate word set specifically includes:

determining word lengths and word frequencies of words in the text data according to the text data based on the language model;

selecting words in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by utilizing a data mining strategy to determine candidate words;

and determining the candidate word set according to the candidate words based on the solidification degree and the freedom degree screening strategy.

Optionally, the determining the candidate word set according to the candidate word based on the solidification degree and the freedom degree screening policy specifically includes:

determining the degree of solidification SD (W _i ) And degree of freedom FD (W _i ) The method comprises the steps of carrying out a first treatment on the surface of the The degree of solidification SD (W _i ) Represented as

Wherein W is _i Representing the candidate word, W _i ＝C ₁ C ₂ ...C _n ，C ₁ C ₂ ...C _n Representing characters, p () representing a probability function;

degree of freedom FD (W) _i ) Represented as

FD(W _i )＝min{LE(W _i ),RE(W _i )}

Wherein LE (W) _i ) Entropy of the left neighbor representing the candidate word, RE (W _i ) Right neighbor entropy representing the candidate word;

selecting the degree of solidification SD (W _i ) Not less than the threshold of solidification degree and the degree of freedom FD (W _i ) The candidate word is not less than a degree of freedom threshold to determine the set of candidate words.

Optionally, the importance result of the candidate word set includes: first importance EMS (W _i )；

The screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set specifically comprises the following steps:

determining the first importance EMS (W _i ) The method comprises the steps of carrying out a first treatment on the surface of the By a means ofThe first importance EMS (W _i ) Represented as

Wherein T is _j Representing a text segment obtained by cutting the text data, r _i () Representing an iterative function.

Optionally, the importance result of the candidate word set further includes: second importance LCS (W _i )；

determining the second importance LCS (W) of each of the candidate words in the set of candidate words with respect to the text data according to the predictive model _i )。

Optionally, the determining the subject term according to the importance result of the candidate term set specifically includes:

determining importance scores of the candidate words according to importance results of the candidate word sets;

sequentially arranging importance scores of the candidate words from large to small, sequentially selecting a preset number of candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores, and determining the selected candidate words as the subject words.

Based on the same inventive concept, one or more embodiments of the present application further provide a subject word mining apparatus, including:

an acquisition module configured to acquire text data;

a filtering module configured to filter the text data based on a language model to determine a set of candidate words;

a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine a importance result for the set of candidate words;

and the determining module is configured to determine the subject word according to the importance result of the candidate word set.

Based on the same inventive concept, one or more embodiments of the present application further provide an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the subject word mining method described in any one of the above when executing the program.

Based on the same inventive concept, one or more embodiments of the present application also provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the subject matter mining method described in any one of the above.

From the foregoing, it can be seen that one or more embodiments of the present application provide a method for mining a subject term, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set; and determining the subject term according to the importance result of the candidate word set. According to the method provided by the application, the text data is obtained by preprocessing the obtained text corpus to be processed, the text corpus to be processed is cleaned, the influence of redundant characters on subject words in the text data is reduced, the text data is filtered through a language model to determine a candidate word set, characters with low condensation degree in the text data are filtered through the language model, and the influence of the loosely spliced characters on subject word mining is reduced; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the excavation range of the subject words is reduced; and further screening the candidate word set according to an unsupervised algorithm and a supervised prediction model, and mining the subject word by sequencing the importance of each candidate word in the candidate word set. The complex structure of the Chinese text corpus is fully considered, and potential subject words formed by emerging professional vocabularies can be mined according to importance degree sequencing while the subject words of the text data are identified through layer-by-layer screening.

Drawings

In order to more clearly illustrate one or more embodiments of the present application or the prior art solutions, the following description will briefly explain the drawings used in the embodiments or the prior art descriptions, and it is apparent that the drawings in the following description are only one or more embodiments of the present application and that other drawings can be obtained according to these drawings without inventive effort to those skilled in the art.

FIG. 1 is a flow diagram of a method of subject matter mining in accordance with one or more embodiments of the present application;

FIG. 2 is a schematic diagram of a subject matter mining apparatus according to one or more embodiments of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to one or more embodiments of the present application.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

It is noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of the terms "first," "second," and the like in one or more embodiments of the present application does not denote any order, quantity, or importance, but rather the terms "first," "second," and the like are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

As described in the background art, in the prior art, as the text in the professional field becomes increasingly abundant, the number of documents in the document database increases exponentially, so that in order to make the scientific research expense input into the research project with high value density, a multi-stage examination link needs to be set in the process of setting up the scientific research project, but in the prior art, the key word is manually extracted or the key research content is generally adopted to perform the repeated work of checking the scientific and technological project compared with the history data of the researched or in-process scientific and technological project.

The applicant finds that in the prior art, the manual check-up mode has high requirements on the professional technical level of staff, and has the problems of low check-up efficiency, check-up omission, error rate and the like. In addition, the research project research work of the prior art faces the challenges of larger text corpus and higher granularity of the project text corpus. Therefore, a method for improving accuracy and efficiency of the repeated work of scientific projects and scientific research documents is needed, i.e. a method for quickly searching similar documents in a document database is needed. The applicant finds that the center of a scientific research project or a scientific research literature is usually surrounded by a plurality of keywords, and the keywords reflect the gist and the meaning of text description to a certain extent, so that for the scientific and technological project in the industry and the research literature check and repeat work, the similarity between texts can be checked only by finding and comparing the industry subject words sent in each text. Moreover, unlike english text, chinese text is described with a more complex organization structure, so chinese documents have a higher processing difficulty than english documents, and the method of manually searching for a repeated pattern or using a subject word for mining english text in the prior art has the problems of low accuracy and failure to accurately mine a potential subject word composed of an emerging professional vocabulary in chinese text.

Therefore, the method provided by the application obtains text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on subject words in the text data, filters the text data through a language model to determine a candidate word set, filters out characters with low condensation degree in the text data through the language model, and reduces the influence of the loosely spliced characters on subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the excavation range of the subject words is reduced; and further screening the candidate word set according to an unsupervised algorithm and a supervised prediction model, and mining the subject word by sequencing the importance of each candidate word in the candidate word set. The complex structure of the Chinese text corpus is fully considered, and the keywords of the text data are identified through layer-by-layer screening, so that potential keywords consisting of emerging professional vocabularies can be mined according to importance degree sequencing, and the accuracy and efficiency of keyword mining and extraction are improved.

The technical scheme of the present disclosure is further described in detail below through specific examples.

Referring to fig. 1, therefore, one or more embodiments of the present application provide a method for mining a subject term, which specifically includes the following steps:

s101: text data is acquired.

In this embodiment, text data to be subject word mined is obtained, specifically, an industrial text corpus to be processed may be obtained, then a preprocessing operation is performed on the industrial text corpus to be processed, and the text data may be obtained by cleaning the industrial text corpus to be processed. The business text corpus to be processed can be text type data such as electronic books, news documents, research papers, digital libraries, WEB pages, emails, database records and the like.

In some alternative embodiments, the preprocessing operation may include: deleting redundant characters, determining text granularity, line segmentation processing, chinese word segmentation, deleting format markers, and the like. For example, for web page text data, the web page tags need to be removed, resulting in plain text.

S102: the text data is filtered based on a language model to determine a set of candidate words.

In this embodiment, after the text data is obtained, since the vocabulary in the text data is huge, the text data may be filtered based on the language model, so as to determine the candidate word set, and the vocabulary in the candidate word set represents the vocabulary in the text data. Specifically, word length and word frequency of words in text data can be determined according to text data based on a language model, words in the text data with word length not greater than a word length threshold and word frequency not less than a word frequency threshold are selected by utilizing a data mining strategy to determine candidate words, and after the candidate words are determined, a candidate word set is determined according to the candidate words based on a freezing degree and freedom degree screening strategy.

In some alternative embodiments, the language model may select an N-gram model, where the range of values for N is [2,10]. After preprocessing operation is performed on industrial text corpus to be processed, granularity of the text data can be determined, the text data is segmented into a plurality of text fragments, wherein the text fragments can be sentences, paragraphs or even the whole document is used as a text fragment, the text data is arranged into a set of text fragments, and punctuation marks in the text fragments are filtered.

In some alternative embodiments, for each word in the text data, the frequency of occurrence F (W _i ) And the word length of the vocabulary, a word length threshold tl=10 and a word frequency threshold tf=3 may be preset. According to the granularity of the text fragments, a data mining strategy (such as an aprri strategy) is used for selecting words in text data with word length not larger than a word length threshold value and word frequency not smaller than a word frequency threshold value, and the words meeting the conditions are determined to be candidate words. An overcomplete dictionary may be composed from candidate words and normalized word frequencies (e.g., interval 0,1]) To initialize the usage of each candidate word.

In some alternative embodiments, after determining the candidate word, the set of candidate words may be determined from the candidate word based on a degree of solidification and a degree of freedom screening policy. Since each candidate word may be composed of a plurality of characters, the candidate word may be represented as W _i ＝C ₁ C ₂ ...C _n-1 C _n ，C ₁ C ₂ ...C _n Representing the character. For any one candidate word W _i In other words, it can analyze from the inner and outer two parts whether the current candidate word is a word that can be used independently, and for the inner, it can analyze the inner degree of fusion, i.e. whether the characters are spliced together tightly enough, and the larger the degree of fusion, the more likely the candidate word becomes an independent word; for the outside, it can analyze whether the current candidate word is independent and free to operate in the whole text data, i.e. how large the uncertainty of the left and right adjacent words of the candidate word is, i.e. how large the entropy is, and the larger the entropy value is, the independent and free to operate the candidate word is indicated.

It should be noted that, determining the candidate word set according to the candidate word based on the solidification degree and the degree of freedom screening policy may specifically include: determining the coagulability SD (W) _i ) And degree of freedom FD (W _i ) Wherein, the coagulability SD (W _i ) Represented as

Where p () represents a probability function. While when the degree of freedom of the candidate word is determined, the left-neighbor entropy and the right-neighbor entropy of the candidate word can be determined first, specifically, when the left-neighbor entropy is calculated, all left-neighbor word sets of any one candidate word are defined as

S _left (W _i )＝{C _0,i ,i＝1,2,...,m}

Wherein C is _0,i Representing the left adjacency word. Further, the probability of occurrence of each character of the left adjacent word can be expressed as

Where count () represents a statistical function. In combination with the parameters, the left-neighbor entropy of any candidate word can be expressed as

Similarly, when right-neighbor entropy is calculated, all right-neighbor word sets of candidate words are defined as

S _right (W _i )＝{C _n+1,i ,i＝1,2,...,m}

Wherein C is _n+1,i Represented as right adjacency word. Thus, the right-neighbor entropy of any one candidate word can be expressed as

After determining the solidification degree and the freedom degree of the candidate words, selecting the candidate words with the solidification degree not smaller than a preset solidification degree threshold value and the freedom degree not smaller than a preset freedom degree threshold value based on a solidification degree and freedom degree screening strategy, and further screening to obtain a candidate word set. The preset coagulation degree threshold value can be 5.3, the preset freedom degree threshold value can be 0.75, and in actual application, the preset coagulation degree threshold value and the freedom degree threshold value can be dynamically adjusted according to the condition of corpus.

S103: the set of candidate words is filtered based on an unsupervised algorithm and a predictive model to determine importance results for the set of candidate words.

In this embodiment, after the candidate word set is determined, the candidate word set may be screened based on an unsupervised algorithm and in combination with a prediction model obtained by supervised training, so as to obtain an importance result of the candidate word set. Wherein, before screening the candidate word set by using an unsupervised algorithm, the following parameters can be defined: assuming that a sentence is made up of a set of words, each word is made up of a set of characters, a dictionary in text data is defined as

D＝{W ₁ ,W ₂ ,...,W _N }。

It is assumed that each sentence in the text data constitutes a sampling probability for each candidate word by concatenating a set of randomly sampled candidate words from the dictionary DThe rate is theta _i . Defining dictionary probability parameters as

For a sentence generated by K vocabularies, it can be expressed as

S＝W _i1 W _i2 ...W _iK 。

The probability of generating the sentence is expressed as

Wherein P () represents a probability matrix comprising a plurality of probability values; for a given piece of text T that is not cut, define C _T For a set of all segmented sentences according to the dictionary D, the probability of an un-segmented text segment T can be expressed as

It should be noted that, the importance result of the candidate word set includes: first importance EMS (W _i ) The method comprises the steps of carrying out a first treatment on the surface of the Screening the candidate word set based on an unsupervised algorithm specifically comprises the following steps: determining a first importance EMS (W) of each candidate word in the set of candidate words relative to the text data according to an unsupervised algorithm (which may also be referred to as an EMwords algorithm) _i ). Determining a first importance EMS (W _i ) It is necessary to define theta ^r Is the parameter of the iterative estimation of the r-th round, determines the iterative function by using the expected step (E-step) of the unsupervised algorithm, expressed as

Obtaining an iterative function Q (θ, θ) using a maximum step (M-step) of an unsupervised algorithm ^r ) The parameter θ can be performed using the following expression ^r Is updated by:

wherein n is _i (S) represents probability of occurrence of candidate word in sentence S, n _i (T _j ) Representing candidate words in a text segment T _j The probability of occurrence in (c) can be expressed as

The sum of the probabilities of candidate words occurring in each text segment can be expressed as

In combination with the above parameters, a first importance EMS (W _i ) Expressed as

Wherein r is _i () Represents an iterative function, r _i (T _j ) Representing importance parameters, which can be expressed as

Where I () represents a selection function, i=1 when the expression in brackets holds, otherwise i=0,representing the optimal iteration parameters. R is as follows _i (T _j ) Is an importance parameter and makes a negative mappingThe first importance EMS (W _i ) At the same time limit the interval [0,1 ]]For example, a word of high importance in a sentence generally occurs more frequently, but the importance parameter r _i (T _j ) The value corresponding to the vocabulary with high appearance frequency is smaller, namely the importance parameter r _i (T _j ) The calculated value is inversely proportional to the importance of the vocabulary, so its interval can be limited by using a logarithmic function while making a negative mapping.

In some alternative embodiments, the feature that the subject word is theoretically solid, does not need to be marked with data and does not depend on a knowledge base is found by using an unsupervised algorithm, and the preliminary screening result obtained after the candidate word set is determined from the text data and screened according to importance is a result with a high lower limit. And then, the supervised and trained prediction model can be utilized for further screening, and for text data in different fields, the obtained public and high-quality expert labels field vocabulary, which is used as a knowledge base or training set, can be used for training the supervised model, so that the effect of the prediction model is improved. For example, in the training data preparation stage, 110 field professional words are obtained by collecting related authoritative field professional dictionaries and journal paper keywords, and in addition, 120 ten thousand universal words are obtained from the online dog searching 30G news corpus. Labeling the domain finding vocabulary and the universal vocabulary respectively, wherein B-COMMON and I-COMMON represent universal vocabulary labels, B-ELEC and I-ELEC represent domain vocabulary labels, prefix B and prefix I represent start marks and internal marks of the vocabulary labels respectively, then segmenting words in the existing professional Word stock and universal Word stock into single characters, and inputting the single characters into a Word2vec network for training to generate character vectors with m dimensions.

In some alternative embodiments, the importance result of the candidate word set further includes: second importance LCS (W _i ) Screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set, further comprising: determining a second importance LCS (W) _i ). Concrete embodimentsIn the method, a predictive model for subject word discovery can adopt a BiLSTM+CRF structure based on word vectors, and compared with the traditional BiLSTM model, a CRF layer is added to learn an optimal path, and the dimension of an output vector of the BiLSTM layer is defined as tag-size, namely, the dimension is equivalent to each candidate word w _i Mapping to labeled tag _j Let the output matrix of BiLSTM layer be P, then P _i,j Representing candidate word w _i Mapping to tags _j Is not normalized to the probability of (a). For the CRF layer, it is assumed that there is a transfer matrix a, where a _i,j Representing tag _i Transfer to tag _j Is a transition probability of (a). For the sequence y of output tags corresponding to the input sequence X (i.e. the number of tags) a score is defined as

Wherein,,representing the transition probability of the ith tag to the (i+1) th tag,/for the transition>Representing candidate word w _i Non-normalized probabilities mapped to the ith tag.

Normalizing the sequence y of each tag label corresponding to the correct tag by using a Softmax function to obtain a probability value, namely likelihood probability, expressed as

Wherein Y is _x The sequence y of all tag tags corresponding to the input sequence X is represented. Therefore, in training, only the likelihood probability p (y|X) needs to be maximized, where log-likelihood is employed, i.e

Finally, the training set formed by the test corpus can be decoded by using the trained prediction model to obtain the probability value of the candidate word machine, and the probability value can be recorded as the second importance LCS (W) of the corresponding candidate word _i )。

S104: and determining the subject term according to the importance result of the candidate word set.

In this embodiment, after determining the importance result of the candidate word set, the importance result of the candidate word set may be determined according to the first importance EMS (W _i ) And a second importance LCS (W _i ) The subject term is determined. Specifically, for the candidate word set screened by the unsupervised algorithm and the prediction model, for each candidate word W therein _i Calculating importance scores, i.e. determining the importance scores of the subject words based on the importance results of the candidate word sets, the importance scores being expressed as

S(W _i )＝(1-μ)EMS(W _i )+μLCS(W _i )

Where μ represents an assigned weight, e.g., μ takes a value of 0.3. Sequentially arranging importance scores of candidate words from large to small, sequentially selecting a preset number (e.g. topN) of candidate words corresponding to the importance scores from the candidate words corresponding to the importance scores with the largest numerical value, determining the selected candidate words as subject words, and completing mining of the subject words.

From the foregoing, it can be seen that one or more embodiments of the present application provide a method for mining a subject term, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set; and determining the subject term according to the importance result of the candidate word set. According to the method provided by the application, the text data is obtained by preprocessing the obtained text corpus to be processed, the text corpus to be processed is cleaned, the influence of redundant characters on subject words in the text data is reduced, the text data is filtered through a language model to determine a candidate word set, characters with low condensation degree in the text data are filtered through the language model, and the influence of the loosely spliced characters on subject word mining is reduced; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the excavation range of the subject words is reduced; and further screening the candidate word set according to an unsupervised algorithm and a supervised prediction model, and mining the subject word by sequencing the importance of each candidate word in the candidate word set. The complex structure of the Chinese text corpus is fully considered, and the keywords of the text data are identified through layer-by-layer screening, so that potential keywords consisting of emerging professional vocabularies can be mined according to importance degree sequencing, and the accuracy and efficiency of keyword mining and extraction are improved.

It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities.

It should be noted that the methods of one or more embodiments of the present description may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of one or more embodiments of the present description, the devices interacting with each other to accomplish the methods.

It should be noted that the foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, one or more embodiments of the present application further provide a subject word mining apparatus, referring to fig. 2, including:

an acquisition module configured to acquire text data;

In some optional embodiments, the acquiring text data specifically includes:

acquiring text corpus of industries to be processed;

In some optional embodiments, the filtering the text data based on the language model to determine the candidate word set specifically includes:

In some optional embodiments, the determining the candidate word set according to the candidate word based on the solidification degree and the freedom degree screening policy specifically includes:

degree of freedom FD (W) _i ) Represented as

FD(W _i )＝min{LE(W _i ),RE(W _i )}

In some alternative embodiments, the importance result of the candidate word set includes: first importance EMS (W _i )；

determining the first importance EMS (W _i ) The method comprises the steps of carrying out a first treatment on the surface of the The first importance EMS (W _i ) Represented as

In some optional embodiments, the importance result of the candidate word set further includes: second importance LCS (W _i )；

In some optional embodiments, the determining the subject term according to the importance result of the candidate term set specifically includes:

determining the importance scores of the subject terms according to the importance results of the candidate word sets;

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in one or more pieces of software and/or hardware when implementing one or more embodiments of the present description.

The device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, one or more embodiments of the present disclosure further provide an electronic device, corresponding to the method of any of the embodiments, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the subject term mining method of any of the embodiments when executing the program.

Fig. 3 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 310, a memory 320, an input/output interface 330, a communication interface 340, and a bus 350. Wherein the processor 310, the memory 320, the input/output interface 330 and the communication interface 340 are communicatively coupled to each other within the device via a bus 350.

The processor 310 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 320 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 320 may store an operating system and other application programs, and when implementing the techniques provided by the embodiments of the present disclosure via software or firmware, the associated program code is stored in memory 320 and invoked for execution by processor 310.

The input/output interface 330 is used for connecting with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 340 is used to connect to a communication module (not shown in the figure) to enable communication interaction between the present device and other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 350 includes a path to transfer information between components of the device (e.g., processor 310, memory 320, input/output interface 330, and communication interface 340).

It should be noted that although the above device only shows the processor 310, the memory 320, the input/output interface 330, the communication interface 340, and the bus 350, in the implementation, the device may further include other components necessary to achieve normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, one or more embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the subject matter mining method as described in any of the embodiments above, corresponding to the method of any of the embodiments above.

The non-transitory computer readable storage media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to execute the subject word mining method according to any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and many other variations exist in the different aspects of one or more embodiments of the present application as described above, which are not provided in detail for simplicity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure one or more embodiments of the present description. Furthermore, the apparatus may be shown in block diagram form in order to avoid obscuring the one or more embodiments of the present description, and also in view of the fact that specifics with respect to implementation of such block diagram apparatus are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present application is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the one or more embodiments of the application, are intended to be included within the scope of the present disclosure.

Claims

1. A method of subject matter mining, comprising:

acquiring text data;

screening the candidate word set based on an unsupervised algorithm and a predictive model to determine an importance result of the candidate word set; the importance result of the candidate word set comprises: first importance EMS (W _i )；

Wherein T is _j Representing a text segment obtained by cutting the text data, r _i () Representing an iterative function;

r _i ( _j ) Representing importance parameters, which can be expressed as

Where I () represents a selection function, S represents a sentence generated by an arbitrary number of words,representing the sentence segmented in the dictionary D, W _i Representing candidate words, P () representing a probability function, D representing a dictionary,/>Representing dictionary probability parameters;

2. The subject matter mining method of claim 1, wherein the obtaining text data specifically includes:

acquiring an industry text corpus to be processed;

3. The subject matter mining method of claim 1, wherein the filtering the text data based on a language model to determine a set of candidate words comprises:

4. The subject matter mining method of claim 3, wherein the determining the set of candidate words from the candidate words based on a degree of solidification and a degree of freedom screening strategy specifically comprises:

Wherein W is _i Representing the candidate word, W _i ＝C ₁ C ₂ ... _n ，C ₁ C ₂ ...C _n Representing characters, p () representsA probability function;

degree of freedom FD (W) _i ) Represented as

FD(W _i )＝min{LE(W _i ),RE(W _i )}

5. The subject matter mining method of claim 1, wherein the importance result for the candidate word set further comprises: second importance LCS (W _i )；

6. The method for mining subject matter of claim 5, wherein determining the subject matter from the importance results of the candidate word set comprises:

7. A subject matter mining apparatus, comprising:

an acquisition module configured to acquire text data;

a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine a importance result for the set of candidate words; the importance result of the candidate word set comprises: first importance EMS (W _i )；

r _i (T _j ) Representing importance parameters, which can be expressed as

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the subject matter word mining method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the subject matter mining method of any of claims 1 to 6.