CN112784009A

CN112784009A - Subject term mining method and device, electronic equipment and storage medium

Info

Publication number: CN112784009A
Application number: CN202011580178.XA
Authority: CN
Inventors: 熊永平; 曹滔宇; 朱承治; 谷纪亭; 徐翀
Original assignee: State Grid Zhejiang Electric Power Co Ltd; Beijing University of Posts and Telecommunications
Current assignee: State Grid Zhejiang Electric Power Co Ltd; Beijing University of Posts and Telecommunications
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-05-11
Anticipated expiration: 2040-12-28
Also published as: CN112784009B

Abstract

One or more embodiments of the present application provide a topic word mining method, apparatus, electronic device, and storage medium, including: acquiring text data; filtering the text data based on the language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject word according to the importance result of the candidate word set. The method filters out characters with low degree of cohesion inside the text data through the language model, reduces the influence of characters which are not tightly spliced on the excavation of the subject term, reflects the uncertainty of left and right adjacent characters of the vocabulary through the degree of freedom of the vocabulary in the text data, finds out the vocabulary which can be freely and independently used, reduces the excavation range of the subject term, fully considers the complex structure of Chinese text linguistic data, and can excavate the potential subject term formed by emerging professional vocabularies according to the importance degree sequence while identifying the subject term of the text data through layer-by-layer screening.

Description

Subject term mining method and device, electronic equipment and storage medium

Technical Field

One or more embodiments of the present application relate to the field of data mining technologies, and in particular, to a topic word mining method and apparatus, an electronic device, and a storage medium.

Background

In the prior art, the text volume of the project text duplication checking work is large, the granularity of the project text is high, and the rapid retrieval of similar documents in a document database becomes a primary problem of improving the accuracy and efficiency of the duplication checking work. Since scientific and technological projects or scientific research documents usually surround a plurality of keywords, and the keywords reflect the gist of text description to a certain extent, similarity between texts can be checked only by finding and comparing the industry subject terms found in each text.

For a Chinese text, the method has a more complex organization structure than an English text, and the problem that the method for mining subject words of an English text in the prior art is generally applied to mining subject words of a Chinese text or extracting subject words based on manual work, so that the accuracy is low, and the potential subject words composed of emerging professional vocabularies in the Chinese text cannot be accurately mined.

Disclosure of Invention

In view of the above, one or more embodiments of the present application are directed to a topic mining method, an apparatus, an electronic device and a storage medium, so as to solve at least one of the above problems in the prior art.

In view of the above, one or more embodiments of the present application provide a topic word mining method, including:

acquiring text data;

filtering the text data based on a language model to determine a set of candidate words;

screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set;

and determining the subject word according to the importance result of the candidate word set.

Optionally, the acquiring text data specifically includes:

acquiring an industry text corpus to be processed;

preprocessing the industry text corpus to obtain the text data; the preprocessing operation comprises: deleting redundant characters, determining text granularity and performing line division processing.

Optionally, the filtering the text data based on the language model to determine a candidate word set specifically includes:

determining word length and word frequency of words in the text data according to the text data based on the language model;

selecting the vocabulary in the text data with the word length not greater than a word length threshold and the word frequency not less than a word frequency threshold by using a data mining strategy to determine candidate words;

and determining the candidate word set according to the candidate words based on the degree of solidification and the degree of freedom screening strategy.

Optionally, the determining the candidate word set according to the candidate words based on the solidity and the degree of freedom screening policy specifically includes:

determining a degree of solidity SD (W) of the candidate word_i) And degree of freedom FD (W)_i) (ii) a Degree of solidity SD (W) of the candidate word_i) Is shown as

Wherein, W_iRepresents said candidate word, W_i＝C₁C₂...C_n，C₁C₂...C_nRepresenting characters, p () representing a probability function;

degree of freedom FD (W) of the candidate word_i) Is shown as

FD(W_i)＝min{LE(W_i),RE(W_i)}

Wherein LE (W)_i) Left-adjacent entropy, RE (W), representing the candidate word_i) Representing a right neighbor entropy of the candidate word;

selecting the solidity SD (W) based on the solidity and freedom screening strategy_i) Not less than a freezing degree threshold and the degree of freedom FD (W)_i) The candidate words not less than a degree of freedom threshold to determine the set of candidate words.

Optionally, the result of the importance of the candidate word set includes: EMS of first importance (W)_i)；

The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically includes:

determining the first EMS (W) of importance of each of the candidate words in the set of candidate words with respect to the text data according to the unsupervised algorithm_i) (ii) a The first importance EMS (W)_i) Is shown as

Wherein, T_jRepresenting a text segment r obtained after segmenting the text data_i() An iteration function is represented.

Optionally, the result of the importance of the candidate word set further includes: second importance LCS (W)_i)；

The screening of the candidate word set based on the unsupervised algorithm and the prediction model to determine the importance result of the candidate word set specifically comprises:

determining the second importance LCS (W) of each of the candidate words of the set of candidate words with respect to the text data according to the prediction model_i)。

Optionally, the determining a subject word according to the result of the importance of the candidate word set specifically includes:

determining the importance scores of the candidate words according to the importance results of the candidate word set;

and sequentially arranging the importance scores of the candidate words from large to small, selecting a preset number of the candidate words corresponding to the importance scores from the candidate words corresponding to the maximum importance scores in sequence, and determining the selected candidate words as the subject words.

Based on the same inventive concept, one or more embodiments of the present application further provide a topic word mining apparatus, including:

an acquisition module configured to acquire text data;

a filtering module configured to filter the textual data based on a language model to determine a set of candidate words;

a screening module configured to screen the set of candidate words based on an unsupervised algorithm and a predictive model to determine an importance result of the set of candidate words;

and the determining module is configured to determine the subject word according to the importance result of the candidate word set.

Based on the same inventive concept, one or more embodiments of the present application further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the topic word mining method described in any one of the above items.

Based on the same inventive concept, one or more embodiments of the present application further propose a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the topic word mining method described in any one of the above.

As can be seen from the above description, one or more embodiments of the present application provide a topic word mining method, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject word according to the importance result of the candidate word set. The method provided by the application obtains the text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on subject words in the text data, filters the text data through the language model to determine a candidate word set, filters out characters with low degree of cohesion in the text data through the language model, and reduces the influence of loosely spliced characters on subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the mining range of the subject word is reduced; and further screening a candidate word set according to an unsupervised algorithm and a supervised prediction model, and excavating the subject words by sequencing the importance of each candidate word in the candidate word set. The complex structure of Chinese text corpora is fully considered, and through layer-by-layer screening, the subject terms of the text data are recognized, and meanwhile, potential subject terms formed by emerging professional vocabularies can be mined according to the importance degree sequence.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions in the present application, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the description below are only one or more embodiments in the present application, and that other drawings can be obtained by those skilled in the art without inventive effort from these drawings.

FIG. 1 is a flow diagram of a topic word mining method in one or more embodiments of the application;

fig. 2 is a schematic structural diagram of a topic word mining apparatus according to one or more embodiments of the present application;

fig. 3 is a schematic structural diagram of an electronic device in one or more embodiments of the present application.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present application shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the present application does not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background section, in the prior art, as texts in professional fields become increasingly rich, the number of documents in the document database increases exponentially, and in order to make scientific research expenses invest in research projects with high value density, a multi-stage review link needs to be set in the establishment process of the scientific research projects.

The applicant finds, through research, that in the prior art, the above manual-based duplicate checking mode has high requirements on the professional skill level of workers, and has the problems of low query efficiency, duplicate checking omission and error rate and the like. In addition, the duplication checking work of the scientific research projects at present faces the challenges of larger text corpus volume and higher project text corpus granularity. Therefore, a method capable of improving the accuracy and efficiency of the duplication checking work of scientific and technical projects and scientific research documents, that is, a method capable of quickly searching out similar documents in a document database, needs to be found. The applicant finds that the centers of scientific research projects or scientific research documents usually surround a plurality of keywords which reflect the gist of text description to a certain extent, so that for the duplication checking work of the scientific research projects and the scientific research documents in the industry, the similarity between texts can be checked only by finding and comparing the industry subject words found in each text. Moreover, unlike the english text, the chinese text has a more complex organization structure, so the chinese document has a higher processing difficulty than the english document, and the methods of the prior art based on manual duplication searching or using the method of mining the subject words of the english text have the problems of low accuracy and incapability of accurately mining the potential subject words composed of new and emerging professional vocabularies in the chinese text.

Therefore, the method provided by the application obtains the text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on the subject words in the text data, filters the text data through the language model to determine a candidate word set, filters out characters with low degree of cohesion in the text data through the language model, and reduces the influence of untight characters in splicing on the subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the mining range of the subject word is reduced; and further screening a candidate word set according to an unsupervised algorithm and a supervised prediction model, and excavating the subject words by sequencing the importance of each candidate word in the candidate word set. The complex structure of Chinese text corpora is fully considered, and through layer-by-layer screening, the subject terms of the text data are identified, meanwhile, potential subject terms formed by emerging professional vocabularies can be excavated according to the importance degree sequence, and the accuracy and efficiency of the subject term excavation and extraction are improved.

Hereinafter, the technical means of the present disclosure will be described in further detail with reference to specific examples.

Referring to fig. 1, a topic word mining method provided in one or more embodiments of the present application specifically includes the following steps:

s101: text data is acquired.

In this embodiment, the text data to be subject to topic word mining is obtained, specifically, the text data is obtained by obtaining an industry text corpus to be processed, then performing a preprocessing operation on the industry text corpus to be processed, and cleaning the industry text corpus to be processed. The industrial corpus of text to be processed may be text-type data such as electronic books, news documents, research papers, digital libraries, WEB pages, e-mails, database records, and the like.

In some optional embodiments, the pre-processing operation may include: deleting redundant characters, determining text granularity, line division processing, Chinese word segmentation, deleting format marks and the like. For example, for web page text data, the web page tag needs to be removed, so that plain text is obtained.

S102: filtering the text data based on a language model to determine a set of candidate words.

In this embodiment, after the text data is obtained, because the vocabulary in the text data is huge, the text data may be filtered based on the language model, so as to determine the candidate word set, and the vocabulary in the candidate word set represents the vocabulary in the text data. Specifically, word length and word frequency of words in text data can be determined according to the text data based on a language model, then words in the text data with word length not greater than a word length threshold and word frequency not less than a word frequency threshold are selected by using a data mining strategy to determine candidate words, and after the candidate words are determined, a candidate word set is determined according to the candidate words based on a freezing degree and freedom degree screening strategy.

In some alternative embodiments, the language model may select an N-gram model, where N is in the range of [2,10 ]. After preprocessing operation is carried out on an industrial text corpus to be processed, the granularity of text data can be determined, the text data is segmented into a plurality of text segments, wherein the text segments can be sentences, paragraphs, or even a whole document is used as one text segment, the text data is arranged into a set of the text segments, and punctuation marks in the text segments are filtered out.

In some alternative embodiments, for each vocabulary in the text data, the frequency F (W) with which the vocabulary appears in the text data is calculated_i) And the word length of the word, a word length threshold TL of 10 and a word frequency threshold TF of 3 may be set in advance. According to the granularity of the text segments, selecting vocabularies in the text data with the word length not greater than the word length threshold and the word frequency not less than the word frequency threshold by using a data mining strategy (such as an Aprori strategy), and determining the vocabularies meeting the conditions as candidate words. An overcomplete dictionary may be composed from candidate words and normalized word frequency (e.g., interval of [0, 1)]) The usage of each candidate word is initialized.

In some alternative embodiments, after determining the candidate words, a set of candidate words may be determined from the candidate words based on a solidity and freedom screening strategy. Since each candidate word may be composed of a plurality of characters, the candidate word may be represented as W_i＝C₁C₂...C_n-1C_n，C₁C₂...C_nRepresenting a character. For any one candidate word W_iFor the inside, the degree of cohesion of the inside can be analyzed, namely whether the characters are spliced together tightly enough, and the higher the degree of cohesion, the more likely the candidate word becomes an independent word; for the external part, whether the performance of the current candidate word in the whole text data can be independently and freely applied can be analyzed, namely, the uncertainty of the left and right adjacent characters of the candidate word is large, namely the entropy is large, and the larger the entropy value is, the candidate word can be independently and freely applied is indicated.

It should be noted that, determining a candidate word set according to the candidate words based on the degree of solidification and the degree of freedom screening policy may specifically include: determining a solidity SD (W) of a candidate word_i) And degree of freedom FD (W)_i) Wherein the degree of solidity SD (W) of the candidate word_i) Is shown as

Where p () represents a probability function. When determining the degree of freedom of a candidate word, the left-adjacent entropy and the right-adjacent entropy of the candidate word may be determined first, and specifically, when calculating the left-adjacent entropy, a set of all left-adjacent words of any one candidate word is defined as

S_left(W_i)＝{C_0,i,i＝1,2,...,m}

Wherein, C_0,iRepresenting the left adjacent word. Further, the probability of occurrence of each character of the left-adjacent word may be expressed as

Where count () represents a statistical function. By combining the above parameters, the left-adjacent entropy for any candidate word can be expressed as

Similarly, when calculating the right adjacent entropy, defining all right adjacent character sets of the candidate words as

S_right(W_i)＝{C_n+1,i,i＝1,2,...,m}

Wherein, C_n+1,iRepresented as a right adjacent word. Thus, the right neighbor entropy of any one candidate word can be expressed as

After the degree of solidification and the degree of freedom of the candidate words are determined, selecting the candidate words with the degree of solidification not less than a preset degree of solidification threshold value and the degree of freedom not less than a preset degree of freedom threshold value based on the degree of solidification and degree of freedom screening strategies, and further screening to obtain a candidate word set. The preset freezing degree threshold value can be 5.3, the preset freedom degree threshold value can be 0.75, and in actual application, the preset freezing degree threshold value and the preset freedom degree threshold value can be dynamically adjusted according to the situation of the corpus.

S103: screening the set of candidate words based on an unsupervised algorithm and a predictive model to determine an importance result for the set of candidate words.

In this embodiment, after the candidate word set is determined, the candidate word set may be screened based on an unsupervised algorithm and in combination with a prediction model obtained by supervised training, so as to obtain an importance result of the candidate word set. Before the unsupervised algorithm is used to filter the candidate word set, the following parameters may be defined: assuming that a sentence is composed of a set of words, each word is composed of a set of characters, a dictionary in text data is defined as

D＝{W₁,W₂,...,W_N}。

Suppose that each sentence in the text data is constructed by concatenating a set of candidate words randomly sampled from the dictionary D with a sampling probability of θ_i. Defining a dictionary probability parameter as

For a sentence generated by K vocabularies, the sentence can be expressed as

S＝W_i1W_i2...W_iK。

The probability of generating the sentence is expressed as

Wherein P () represents a probability matrix comprising a plurality of probability values; for a given one of the unsliced text segments T, define C_TFor the set of all segmented sentences according to the dictionary D, the probability of an unsingulated text segment T can be expressed as

It should be noted that the importance result of the candidate word set includes: EMS of first importance (W)_i) (ii) a The screening of the candidate word set based on the unsupervised algorithm specifically comprises the following steps: determining a first importance EMS (W) of each candidate word in the candidate word set relative to the text data according to an unsupervised algorithm (also called EMwords algorithm)_i). Determining first importance EMS (W)_i) It is necessary to define theta^rIs a parameter of the iterative estimation of the r-th round, and determines an iterative function by using an expected step (namely E-step) of an unsupervised algorithm, which is expressed as

Obtaining an iterative function Q (theta ) using a very large step (M-step) of an unsupervised algorithm^r) Can be calculated by using the following expression^rUpdating:

wherein n is_i(S) represents the probability of the candidate word appearing in the sentence S, n_i(T_j) Indicating candidate words in text segment T_jCan be expressed as

The sum representing the probability of the candidate word occurring in each text passage can be expressed as

Integrating the above parameters to determine a first importance EMS (W)_i) Is shown as

Wherein r is_i() Representing an iterative function, r_i(T_j) Represents an importance parameter, which can be expressed as

Where I () denotes a selection function, I is 1 when the expression in parentheses is true, otherwise I is 0,

representing the optimal iteration parameters. In addition, r is_i(T_j) Is an importance parameter, and defines a first importance EMS (W) by negative mapping_i) While limiting the interval [0,1 ]]For example, a word with high importance in a sentence usually appears more frequently, but the importance parameter r_i(T_j) For words with a high frequency of occurrence, the corresponding value is smaller, i.e. the importance parameter r_i(T_j) The calculated value is inversely proportional to the importance of the vocabulary, so that the interval can be limited by a logarithmic function while negative mapping is performed.

In some optional embodiments, the unsupervised algorithm is used to find that the subject word has the characteristics of being theoretically solid, not needing to mark data and not depending on a knowledge base, and after the candidate word set is determined from the text data and screened according to the importance degree, the obtained preliminary screening result is a result with a high lower limit. And then, a prediction model which is well supervised and trained can be used for further screening, and for text data in different fields, field vocabularies can be labeled through the acquired public and high-quality experts and used as a knowledge base or a training set to train the supervised model, so that the effect of the prediction model is improved. For example, in the training data preparation stage, a related authoritative domain professional dictionary and journal paper keywords are collected to obtain 110 domain professional words, and a general word of over 120 thousands is obtained from 30G news corpus of a dog on the network. Respectively labeling the domain vocabularies and the general vocabularies, wherein B-COMMON and I-COMMON represent general vocabulary labels, B-ELEC and I-ELEC represent domain vocabulary labels, and prefix B and prefix I represent start marks and internal marks of the vocabulary labels respectively, then segmenting words in the existing professional lexicon and general lexicon into single characters, and reserving Word structures to input the words into a Word2vec network for training to produce character vectors with the dimension of m.

In some optional embodiments, the importance result of the candidate word set further includes: second importance LCS (W)_i) Screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set, further comprising: determining a second importance LCS (W) of each candidate word of the set of candidate words with respect to the text data on the basis of the prediction model_i). Specifically, the prediction model for topic word discovery may adopt a BilSTM + CRF structure based on word vectors, and compared with the traditional BilSTM model, a CRF layer is added to learn an optimal path, and the output vector dimension of the BilSTM layer is defined as tag-size, that is, equivalent to w candidate words per candidate word_iMapping to tagged tag_jLet P be the output matrix of the BilSTM layer, then P_i,jRepresents a candidate word w_iMapping to tag_jIs measured. For the CRF layer, it is assumed that there is a transition matrix A, where A_i,jRepresents tag_iTransfer to tag_jThe transition probability of (2). For the sequence y of the output tag corresponding to the input sequence X (i.e. representing the second tag), the score is defined as

Wherein the content of the first and second substances,

indicating the transition probability of the ith label to the (i + 1) th label,

represents a candidate word w_iNon-normalized probability of mapping to ith label.

Normalizing the sequence y of each tag label correctly corresponding to the tag label by using a Softmax function to obtain a probability value, namely a likelihood probability represented as

Wherein, Y_xAnd a sequence y representing all tag labels corresponding to the input sequence X. Thus, in training, it is only necessary to maximize the likelihood probability p (y | X), where log-likelihood is used, i.e.

Finally, the trained prediction model can be used to decode the training set composed of the test corpus to obtain the machine probability value of the candidate word, and this probability value can be recorded as the second importance LCS (W) of the corresponding candidate word_i)。

S104: and determining the subject word according to the importance result of the candidate word set.

In this embodiment, after determining the importance result of the candidate word set, the first importance EMS (W) may be synthesized according to the importance result of the candidate word set_i) And a second importance LCS (W)_i) A subject term is determined. Specifically, for the candidate word set after the unsupervised algorithm and the prediction model are screened, for each candidate word W in the candidate word set_iCalculating an importance score, namely determining the importance score of the subject word according to the importance result of the candidate word set, wherein the importance score is expressed as

S(W_i)＝(1-μ)EMS(W_i)+μLCS(W_i)

Where μ denotes an assignment weight, for example, μ takes a value of 0.3. And (3) sequentially arranging the importance scores of the candidate words from large to small, sequentially selecting the candidate words corresponding to the importance scores of a preset number (for example, topN) from the candidate word corresponding to the importance score with the largest value, determining the selected candidate words as the subject words, and finishing the mining of the subject words.

As can be seen from the above description, one or more embodiments of the present application provide a topic word mining method, including: acquiring text data; filtering the text data based on a language model to determine a set of candidate words; screening the candidate word set based on an unsupervised algorithm and a prediction model to determine an importance result of the candidate word set; and determining the subject word according to the importance result of the candidate word set. The method provided by the application obtains the text data by preprocessing the obtained text corpus to be processed, cleans the text corpus to be processed, reduces the influence of redundant characters on subject words in the text data, filters the text data through the language model to determine a candidate word set, filters out characters with low degree of cohesion in the text data through the language model, and reduces the influence of loosely spliced characters on subject word mining; the uncertainty of left and right adjacent words of the vocabulary is reflected through the degree of freedom of the vocabulary in the text data, the vocabulary which can be freely and independently used is found, and the mining range of the subject word is reduced; and further screening a candidate word set according to an unsupervised algorithm and a supervised prediction model, and excavating the subject words by sequencing the importance of each candidate word in the candidate word set. The complex structure of Chinese text corpora is fully considered, and through layer-by-layer screening, the subject terms of the text data are identified, meanwhile, potential subject terms formed by emerging professional vocabularies can be excavated according to the importance degree sequence, and the accuracy and efficiency of the subject term excavation and extraction are improved.

It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above description describes certain embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, one or more embodiments of the present application further provide a topic word mining apparatus, which, with reference to fig. 2, includes:

an acquisition module configured to acquire text data;

In some optional embodiments, the acquiring text data specifically includes:

acquiring a text corpus of an industry to be processed;

In some optional embodiments, the filtering the text data based on the language model to determine the candidate word set specifically includes:

In some optional embodiments, the determining the candidate word set according to the candidate words based on the solidity and the degree of freedom screening policy specifically includes:

Wherein, W_iRepresents said candidate word, W_i＝C₁C₂...C_n，C₁C₂...C_nRepresents a character, P () represents a probability function;

degree of freedom FD (W) of the candidate word_i) Is shown as

FD(W_i)＝min{LE(W_i),RE(W_i)}

In some optional embodiments, the importance result of the candidate word set includes: EMS of first importance (W)_i)；

In some optional embodiments, the importance result of the candidate word set further includes: second importance LCS (W)_i)；

In some optional embodiments, the determining a subject word according to the result of the importance of the candidate word set specifically includes:

determining the importance score of the subject word according to the importance result of the candidate word set;

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above embodiments, one or more embodiments of the present specification further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the topic word mining method according to any of the above embodiments.

Fig. 3 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 310, a memory 320, an input/output interface 330, a communication interface 340, and a bus 350. Wherein the processor 310, memory 320, input/output interface 330, and communication interface 340 are communicatively coupled to each other within the device via bus 350.

The processor 310 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.

The Memory 320 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 320 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 320 and called to be executed by the processor 310.

The input/output interface 330 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 340 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 350 includes a path that transfers information between the various components of the device, such as processor 310, memory 320, input/output interface 330, and communication interface 340.

It should be noted that although the above-mentioned device only shows the processor 310, the memory 320, the input/output interface 330, the communication interface 340 and the bus 350, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, one or more embodiments of the present specification further provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the subject word mining method according to any of the above-described embodiment.

Non-transitory computer readable storage media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the topic word mining method according to any one of the above embodiments, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments in this application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

It is intended that the one or more embodiments of the present application embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A topic word mining method, comprising:

acquiring text data;

2. The topic word mining method according to claim 1, wherein the obtaining text data specifically comprises:

acquiring an industry text corpus to be processed;

3. The topic word mining method of claim 1, wherein filtering the textual data based on a language model to determine a set of candidate words comprises:

4. The topic word mining method according to claim 3, wherein the determining the candidate word set according to the candidate words based on the solidity and freedom degree screening strategy specifically comprises:

degree of freedom FD (W) of the candidate word_i) Is shown as

FD(W_i)＝min{LE(W_i),RE(W_i)}

5. The topic word mining method of claim 3, wherein the importance result of the candidate word set comprises: first of allEMS importance (W)_i)；

6. The topic word mining method of claim 5, wherein the importance result of the candidate word set further comprises: second importance LCS (W)_i)；

7. The method according to claim 6, wherein the determining the topic word according to the importance result of the candidate word set specifically comprises:

8. A topic word mining device, comprising:

an acquisition module configured to acquire text data;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the topic word mining method of any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the topic word mining method of any one of claims 1 to 7.