CN113609844B

CN113609844B - Electric power professional word stock construction method based on hybrid model and clustering algorithm

Info

Publication number: CN113609844B
Application number: CN202110874173.6A
Authority: CN
Inventors: 陈文刚; 宰洪涛; 刘建国; 张轲; 许泳涛; 何洪英; 罗滇生; 尹希浩; 奚瑞瑶; 符芳育; 方杰
Original assignee: Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Current assignee: Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2024-03-08
Anticipated expiration: 2041-07-30
Also published as: CN113609844A

Abstract

The invention relates to the field of artificial intelligence, in particular to a method for constructing an electric power professional word stock based on a mixed model and a clustering algorithm. Preprocessing an electric text and a parallel corpus, and then performing Word segmentation through a Word segmentation model, wherein mutual information, a left-right entropy algorithm and a TextRank algorithm perform Word combination on a resultant Word segmentation result, a TF-IDF algorithm and a Word2Vec Word clustering algorithm extract text keywords from the resultant Word segmentation result, the text Word is directly segmented by the information entropy Word segmentation algorithm, and the results are summarized and compared to obtain characteristic corpus words; selecting an electric professional vocabulary from the feature corpus words as a seed word; meanwhile, the derived electric text word stock is used as a candidate word to segment the electric text, and then word2vec algorithm is used to change words into word vectors; clustering to obtain similar words, and then filtering to obtain a power professional word stock. According to the invention, most of professional words in the non-electric power field can be filtered by using one clustering model, and the professional words are complete.

Description

Electric power professional word stock construction method based on hybrid model and clustering algorithm

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method for constructing an electric power professional word stock based on a mixed model and a clustering algorithm.

Background

In Chinese language, single word has worse ideographic ability and more dispersed meaning, and word has stronger ideographic ability, which can describe a thing more accurately, therefore, in natural language processing, words (including single word forming) are the most basic processing units in general. For Latin languages such as English, the words can be simply and accurately extracted because spaces are arranged among the words to be used as word margin representation. The Chinese language is characterized in that words are closely connected with each other except punctuation marks, and no obvious boundary exists, so that words are difficult to extract. The Chinese word segmentation method is roughly divided into three types: dictionary-based segmentation, statistical model-based segmentation and rule-based segmentation. Dictionary-based segmentation is a relatively common and efficient word segmentation mode, and the premise is that a word stock is needed.

The power profession field does not currently establish a complete power profession word stock. With the increase of the semantic understanding demands on the electric power text, the demands for constructing word stock in the electric power professional field are more and more urgent. The power profession field accumulates a large amount of text data including power science and technology papers, project reports, power regulations, power operation manuals, and the like. Based on the data, the natural language processing technology is utilized to develop the vocabulary discovery research in the power professional field, so that a dictionary in the power professional field is constructed, and the method has important significance for the text understanding, mining and information management in the power field. However, since the text mining technology belongs to a new technology in the field of artificial intelligence, which appears in recent years, the word segmentation discovery and word stock construction technology also belongs to an emerging front field in the domestic power professional field, most of researches are still in a research test stage, and the application effect is not yet revealed.

Chinese is different from most western languages, no obvious space mark exists between words of written Chinese, sentences appear in the form of character strings, and the first step in Chinese processing is automatic word segmentation, namely, character strings are converted into word strings. The Chinese word segmentation method is complex and changeable in language, intersection ambiguity exists in Chinese, combination ambiguity exists, ambiguity which cannot be solved in sentences, and the Chinese word segmentation method has the characteristics of unregistered words and the like, so that the Chinese word segmentation is difficult. If the language processing task is to be completed well, word segmentation operation is needed first when the text data mining is performed. The existing common word segmentation methods are based on a manual word stock, and some common words can be manually collected into the word stock, but the method cannot cope with endless new words, particularly network new words. Which is often the key place for the task of word segmentation in a language. Therefore, one core task of Chinese word segmentation is to perfect a new word discovery algorithm. New word discovery, namely, automatically discovering language fragments which can form words directly from a large-scale corpus without adding any priori materials.

Disclosure of Invention

The invention aims to provide a method for constructing an electric power professional word stock based on a hybrid model and a clustering algorithm. The method can overcome the defect of word segmentation algorithm in the existing word stock construction technology in the electric power professional field, and has the function of mining new words for electric power text data.

The scheme of the invention comprises the following steps:

preprocessing an electric text and a parallel corpus, removing blank spaces, punctuation marks and words without entity meaning, and obtaining qualified input text data;

step two, word segmentation is carried out on the electric text and the parallel corpus which is not in the electric specialty through a word segmentation model, so as to obtain an electric text word stock and a parallel corpus word stock, and the electric text word stock is compared with the parallel corpus word stock to obtain characteristic corpus words;

step three, selecting electric professional vocabulary from the feature corpus words as seed words; meanwhile, the electric text word stock derived in the second step is used as a candidate word to segment the electric text, and then word2vec algorithm is used to change words into word vectors;

and step four, inputting word vectors and seed words into a clustering model, clustering to obtain words in the electric power professional field, filtering non-electric power professional words according to rules, and finally obtaining an electric power professional word stock.

In the first step, the electric power text includes an electric power science and technology paper, a project report, an electric power regulation, an electric power operation manual and the like, and the parallel corpus can adopt a crawled wikipedia corpus.

In the Word segmentation model, a Word set 1 is obtained based on the Jieba Word segmentation and through a TF-IDF statistical model, a Word2Vec Word clustering model, a textRank model, a left-right information entropy and a mutual information entropy model, a Word set 2 is built through frequency, solidification degree and degree of freedom, and finally two Word sets are combined to obtain a final Word stock.

The word set 1 is established as follows: keyword is extracted through the nubilation Word segmentation, the TF-IDF model and the Word2Vec Word clustering model, word combination is carried out through the textRank model, the left and right information entropy and the mutual information entropy model, and then the words are combined to obtain a Word set 1.

The vocabulary 2 establishment process is as follows:

(1) And (3) statistics: counting the frequency of each word from the corpus and counting the co-occurrence frequency P of two adjacent words _ab ；

(2) Cutting: respectively setting a threshold value min_prob of the occurrence frequency and a threshold value min_pmi of mutual information, and then adding P in the corpus _ab <min_prob orCutting the adjacent characters;

(3) Cutting: after the segmentation in the step (2), the frequency P of each quasi word is obtained in the statistics step (2) _w′ Keep only P _w′ >min_prob portion;

(4) Removing redundant matters: arranging the candidate words obtained in the step (3) from more to less according to the number of words, deleting each candidate word in the word stock in sequence, segmenting the candidate word by using the rest words and word frequency, calculating the mutual information of the original word and the sub word, and determining the mutual information of the original word and the sub word according to the mutual informationIf the mutual information is more than 1, recovering the word, otherwise, maintaining deleting and updating the frequency of the segmented sub-word;

(5) And (3) statistics: the words obtained after the step 4 redundancy removal are counted based on the word set to obtain left information entropy of each wordRight information entropy->Wherein n and m are the non-repeated numbers of the left adjacent word and the right adjacent word of each word respectively, the free application degree of the text fragment is defined as a smaller value in the left adjacent word information entropy and the right adjacent word information entropy, a threshold value min_pdofof the degree of freedom is set, and if the degree of freedom is larger than the threshold value, the fragment is considered to be independent into words.

According to the invention, the text of the literature in the electric power field is segmented based on the hybrid model, and the segmented fragments accord with Chinese semantics, so that the segmentation task can be effectively completed. The invention uses the mixed model than uses the single model, the establishment of word stock is more complete, the word is richer. The words extracted based on the mixed model contain partial non-electric field words, the clustering model is adopted to cluster the electric field professional words, and the clustering result shows that most of the non-electric field professional words can be filtered, so that the clustering electric field professional words are complete and have good effects.

Drawings

FIG. 1 is a schematic diagram of a text preprocessing process.

FIG. 2 is a schematic diagram of a process of extracting feature corpus words.

FIG. 3, word segmentation model.

Fig. 4 is a schematic diagram of a word stock construction process in the power professional field.

FIG. 5, word combination schematic.

Detailed Description

The invention discloses a method for constructing an electric power professional word stock based on a mixed model and a clustering algorithm, which comprises the following steps:

step one, preprocessing an electric text and parallel corpus, including deleting blank spaces, punctuation marks, special characters and some characters or words without entity meaning in initial text data, and obtaining qualified input text data;

step two, word segmentation is carried out on the electric power text and the parallel corpus of non-electric power major through a word segmentation model, so as to obtain an electric power text word stock and a parallel corpus word stock, and the electric power text word stock is compared with the parallel corpus word stock to obtain characteristic corpus words of the electric power field;

step three, the feature corpus words still contain non-electric professional vocabulary, and the electric professional vocabulary is selected from the feature corpus words to serve as seed words; meanwhile, the electric text word stock derived in the second step is used as a candidate word to segment the electric text, and then word2vec algorithm is used to change words into word vectors;

In the text data preprocessing shown in fig. 1, there are a large number of spaces and punctuation marks, special characters such as%, and some words having no physical meaning such as and, it, etc., and in the initial power domain text data and the parallel corpus text data. To obtain acceptable input text, the text needs to be processed accordingly. The text in the electric power professional field comprises electric power science and technology papers, project reports, electric power regulations, electric power operation manuals and the like, and the parallel corpus can be wikipedia corpus, people daily necessities and the like, and is to be distinguished from electric power text data. In addition, the electric text data and the parallel corpus are large enough, and the built word stock can be large enough.

In the extracted feature corpus words shown in fig. 2, the text of the electric power professional field and the parallel corpus are segmented by a segmentation model to obtain two word libraries, and the two word libraries are compared to obtain feature corpus words.

The word set 1 is established as follows:

(1) Barking segmentation: the Jieba word segmentation is a good text word segmentation tool, can segment the text word more accurately, but has smaller obtained word granularity, so that most of words in the electric power professional field are segmented, and the obtained words in the electric power professional field are not abundant enough. Therefore, to combine these small granularity words, the entire word stock is enriched. As shown in fig. 5:

(2) Combination: the granularity of the result words of the Jieba segmentation is small, and most of words in the electric power professional field are disassembled, so that a final result is obtained through word combination.

Extracting key words by TF-IDF model

The statistical model is a selected TF-IDF model, where TF-IDF is the product of two statistics. There are a number of ways in which a particular value of a statistic can be determined.

Word frequency (TF): word frequency in word w document d, i.e. the specific gravity of the number of occurrences of word w in document count (w, d) and the total number of words size (d) in document d: tf=count (w, d)/size (d)

Inverse Document Frequency (IDF): the inverse document frequency idf of the word w in the whole document collection, i.e. the logarithm of the ratio of the total number of documents n to the number of documents df (w, D) in which the word w appears: idf=log (n/df (w, D))

Therefore: w (w) _i ＝tf _i ×idf

However, the invention uses the improved TF-IDF model as an evaluation standard, the improved model increases the DF penalty,

and b, extracting keywords by a word2Vec word clustering mode.

1) Word2Vec Word vector representation:

the occurrence of words in a corpus is automatically learned by using a shallow neural network model, and the words are embedded into a high-dimensional space, usually in 100-500 dimensions, in which the words are expressed in terms of word vectors. The extraction of feature word vectors is based on a word vector model that has been trained.

2) K-means clustering algorithm:

the clustering algorithm aims to find the relationship between data objects in the data, and group the data so that the similarity in the groups is as large as possible and the similarity between the groups is as small as possible.

The algorithm idea is as follows: firstly, randomly selecting K points as initial centroids, wherein K is the number of expected clusters designated by a user, assigning each point to the nearest centroid to form K clusters by calculating the distance between each point and each centroid, then recalculating the centroid of each cluster according to the point assigned to the cluster, and repeating the operations of assigning and updating the centroids until the clusters are unchanged or the maximum iteration number is reached.

3) The implementation process of the Word2Vec Word clustering keyword extraction method comprises the following steps:

the main idea is that for words represented by Word vectors, the words in the articles are clustered through a K-Means algorithm, a clustering center is selected as a main keyword of a text, distances between other words and the clustering center, namely similarity, are calculated, the K words closest to the clustering center are selected as keywords, and the similarity among the words can be calculated by using vectors generated by Word2 Vec. The method comprises the following specific steps:

i. training a Word2Vec model of the corpus to obtain a Word vector file;

preprocessing the text to obtain N candidate keywords;

traversing the candidate keywords, and extracting word vector representations of the candidate keywords from the word vector file;

iv, carrying out K-Means clustering on the candidate keywords to obtain clustering centers of all the categories;

calculating the distance (Euclidean distance or Manhattan distance) between the words in the group and the clustering center under each category, and sorting in descending order according to the clustering size;

and thirdly, ranking the calculation results of the candidate keywords, and taking the top K words as text keywords.

Extracting key words by using textrank model

The TextRank model regards words as "nodes", constructs word relationships, and calculates the importance of each word according to the collinearly relationship between the words. The algorithm for keyword extraction by TextRank is as follows:

TextRank model:

wherein d is a damping factor, typically 0.85, in (V _i ) The indication points to V _i Is (V) _i ) Represented by V _i The node to which it points. w (w) _ij Representing the result of the node V _i →V _j Is the edge weight of WS (V) _i ) Representing the weight of node i, WS (V _j ) Representing the weight of node j.

1) Segmenting a given text T according to a complete sentence, i.e

2) For each sentence, performing word segmentation and part-of-speech tagging, and filtering out stop words, retaining only words of specified part-of-speech, e.g. nouns, verbs, adjectives, i.e. where t _ij Is a candidate keyword after reservation.

3) Constructing a candidate keyword graph G= (V, E), wherein V is a node set, and is composed of candidate keywords generated in the step (2), then constructing edges between any two points by adopting a co-occurrence relation (co-current), wherein edges exist between the two nodes only when corresponding vocabularies coexist in a window with the length of K, and K represents the window size, namely K words at most coexist.

4) According to the above formula, iteratively propagating the weights of the nodes until convergence.

5) And (3) sorting the node weights in a reverse order, so as to obtain the most important T words as candidate keywords.

6) The most important T words are obtained from the step 5), marked in the original text, and if adjacent phrases are formed, the multi-word keywords are formed.

d. And extracting the multi-word keywords by using the left and right information entropy and the mutual information entropy.

1) Calculating mutual information: first order co-occurrence is found and word frequency is returned. Then find the second order co-occurrence and return mutual information and word frequency. The larger the mutual information (PMI) is, the larger the correlation between the two words a and b is.

2) Calculating left and right entropy; first, find the left frequency, count the left entropy H _{Left side} (x) And returns the left entropy. Then find the right frequency, count the right entropy H _{Right side} (x) And returns the right entropy.

3) Calculation results: score=pmi+min (H _{Left side} (x)，H _{Right side} (x) A larger score indicates a greater probability of combining words.

Finally, the words obtained in the steps a, b, c and d are combined to obtain a word set 1.

In the establishment process of the word set 2, the mutual information entropy is obtained after the solidification degree takes logarithm, and the degree of freedom is the left and right information entropy. The vocabulary 2 establishment process is as follows:

(5) And (3) statistics: the words obtained after the step 4 redundancy removal are counted based on the word set to obtain left information entropy of each wordRight information entropy->Wherein n and m are the non-repeated numbers of the left adjacent word and the right adjacent word of each word respectively, the free application degree of the text fragment is defined as a smaller value in the left adjacent word information entropy and the right adjacent word information entropy, a threshold value min_pdofof the degree of freedom is set, and if the degree of freedom is larger than the threshold value, the fragment is considered to be independent into words. The word set 2 is obtained through the steps.

And selecting seed words from the feature corpus words obtained in the step two. The method comprises the steps of segmenting a text in the electric power professional field to obtain a Word set as a candidate Word, segmenting the electric power text, and training by using a Word vector model to obtain a Word vector, wherein the Word vector model is a Word2Vec model. Then, the words are clustered according to the word vectors obtained by the word vector model, wherein the clustering is performed according to a plurality of selected seed words, and then a batch of similar words are found. The algorithm uses similar transitivity (somewhat similar to the connectivity-based clustering algorithm), i.e., a and B are similar, and B and C are similar, then A, B, C is grouped into one class (even A, C is not similar from an index perspective). Of course, such a pass down is likely to traverse the entire vocabulary, so similar constraints are progressively enforced. For example, A is a seed term, B, C is not a seed term, A, B is similarA degree of 0.6 defines it as similar, and B, C is considered similar if it is greater than 0.7. The similarity threshold calculation formula is as follows: sim (sim) _i ＝k+d×(1-e ^-d×i ) I is the number of passes, k is the initial similarity threshold, and d is typically 0.2-0.5.

Since the previous is purely unsupervised, even if semantic clustering is performed, some non-power professional vocabularies can be generated, and even some non-vocabularies are reserved, so that rules are needed to be used for filtering. And finally filtering the clustered words through rules to obtain a final result.

The method of the invention is shown as follows according to the results of the flow:

(1) Based on the frequency, the solidification degree and the partial word result of the degree of freedom word segmentation: consistent, one, three-phase, expert, insulating pad, fuse blow, capacitor core, winding capacitance, series reactor, switch core, relay baffle, medium voltage side, laboratory, protection action trip, majority, solid insulating material, on-line monitoring system, differential protection action, oilpaper capacitive bushing, insulating oil chromatography, frequency response analysis, light gas protection action, etc.

(2) Extracting a keyword part result based on a statistical model of the Jieba segmentation: coordination, safety measures, oil machines, appropriateness, tripping, operators on duty, power lines, equivalent circuits, inflammability and explosiveness, notes, spirals, transformer plants, national power grids, detection technologies, models, oil stains, gears, magnetic fluxes, insulating papers, plastic cloths, firm, switch cabinets, direct current power supplies, power companies, fire protection measures and the like.

(3) Keyword part results are extracted based on Word2Vec clustering model of Jieba segmentation: specification, leakage, data, teaching, main parameters, ammeter, grounding device, high energy, control measures, filter, computer, carbon monoxide, inspection, operation, equipment, circuit breaker, gas, bus, process, personnel, body, operation, tripping, direct current, responsible person, interior, iron core, etc

(4) Extracting a keyword group part result based on a TextRank model of the Jieba segmentation: power-off protection personnel, in-hole, national grid company, gear box indication, meter chromatographic analysis data, load operation, transformer maintenance, outlet short circuit impact, grounding knife switch, discharge burning trace, action transformer, manufacturing quality problem, linear algorithm, electric energy loss, sleeve external insulation, contact iron core, dendritic discharge trace and the like.

(5) Extracting a keyword group part result based on a mutual information entropy and a left-right information entropy model of the Jieba word segmentation: tap changers, number main transformers, low-voltage windings, winding deformation, low-voltage circuit breakers, hanging cover inspection, oil substitution pumps, transposed conductors, loop disconnection, gas production rates, immersed transformers, busbar voltages, neutral point bushings, preventive power equipment, vacuum oil filters, dry reactors, cylindrical windings and the like.

(6) Based on the clustering model final partial results: the device comprises a main magnetic flux, a sensor, an oil storage tank, a charger, a cooler, a manufacturing plant, a double bus, a transformer, a respirator, a main transformer, # main transformer, a circuit breaker, a low-voltage circuit breaker, a circuit breaker tripping operation, a low voltage, a low-voltage impedance, a high-voltage winding, a winding, winding deformation, a three-phase winding, arc discharge, an oil replacement pump, an on-load voltage regulation switch, bus voltage, transformer fault diagnosis, a bus protection device, a secondary main switch, an insulating cushion block, a fuse fusing, a capacitor core, winding capacitance, a series reactor, a switch core, a relay baffle, a power line electric tool, a charger alternating current power supply, magnetic current, interference pulses, parallel wires, bus voltage, characteristic gases, electrical equipment, a power switch, a direct current power supply, an air switch, a voltage regulation switch, a vacuum oil filter, an oil paper capacitor sleeve, an insulating oil chromatographic analysis, a frequency response analysis method, a light gas protection action, a heavy gas protection action, infrared thermal imaging detection, a position alarm lamp, a voltage pulse analysis method, a non-excitation tapping switch, inflammable and infrared thermal imaging and the like.

The statistical model and the Word2Vec Word clustering model extract keywords based on the results of the crust Word segmentation, and the results can be found to contain non-electric words with fine Word granularity. The mutual information, left and right information entropy models and the TextRank model extract key word groups based on the barking word segmentation result, and the problems of fine granularity of the barking word segmentation and incorrect splitting of the domain words are solved. The word segmentation model based on the information entropy can show that the word segmentation result is accurate and the effect is good. The clustering algorithm is that the word segmentation results of the model are summarized, then the words in the electric power field are clustered, and the clustering result can be found to screen the non-electric power words, so that the clustering effect is obvious. The models are mutually cooperated, and the finally established word stock is complete.

The invention provides a word segmentation method based on information entropy. The invention adopts the method of the minimum information entropy principle to segment the electric text, thereby realizing the more accurate word segmentation effect. According to the method, firstly, the frequency and the solidification degree are utilized to process the electric power text, the quasi words are screened out, then the degree of freedom is utilized to screen the quasi words again, a word stock is initially built, and the word segmentation accuracy is improved.

The invention provides a word recombination method. Because Word segmentation results are based on the crust, because Word granularity is small after Word segmentation, words in some electric power professional fields are disassembled for Word segmentation, and some non-professional words exist, keyword extraction and Word combination are carried out on the Word segmentation results by using a statistical model, a Word2Vec clustering model, a left-right information entropy and mutual information entropy model and a TextRank model, so that Word libraries are enriched and perfected.

The invention provides a word clustering method. The clustering rule is that according to a plurality of selected seed words, a batch of similar words are found, words conforming to the electric power professional field can be clustered from the word stock, and the word stock of the electric power professional field is enriched and perfected.

Compared with the existing word stock construction method, the method has the advantages that:

1. aiming at the defect that the resultant word segmentation granularity is small and the domain words are split by mistake, a mutual information and left and right information entropy word combination algorithm and a TextRank algorithm are used for carrying out word combination on the gas word segmentation result, more electric domain words are found, and the problems are solved. The method is characterized in that mutual information and left and right entropy combination judgment is designed, whether words are combined is stricter, the accuracy of word combination is improved, the TextRank algorithm method is used for extracting keywords, and the keywords are combined in a ranking mode, so that word combination is more accurate.

The TF-IDF algorithm and the Word2Vec Word clustering algorithm are keyword for proposing weight to the barker Word, and can extract important words in the text, namely partial electric field words. An improved TF-IDF algorithm is designed, penalty force of keyword extraction is increased, and keyword extraction is more accurate.

3. The information entropy word segmentation algorithm designs three thresholds, namely word frequency, mutual information, left and right information, strict word forming judgment, improved word segmentation accuracy, summarized word segmentation results of the model, mutually complemented word segmentation results, and more complete candidate word library

4. The word clustering algorithm operation is the word in the clustering field, and can cluster the words in the electric power field to the word segmentation result, filter the words in the non-electric power field, reduce the manual work load, build the word stock more succinctly and conveniently, and build the word stock more fully and better.

Claims

1. The electric power professional word stock construction method based on the mixed model and the clustering algorithm is characterized by comprising the following steps of:

preprocessing electric power texts and parallel corpora without electric power major, and removing space, punctuation marks and non-entity meaning words to obtain qualified input text data;

step two, word segmentation is carried out on the electric text and the parallel corpus through a word segmentation model, so that an electric text word stock and a parallel corpus word stock are obtained, and the electric text word stock is compared with the parallel corpus word stock to obtain characteristic corpus words;

step three, selecting electric professional vocabulary from the feature corpus words as seed words; meanwhile, the electric text word library derived in the second step is used for word segmentation of the electric text, and then word2vec algorithm is used for changing words into word vectors;

step four, inputting word vectors and seed words into a clustering model, clustering to obtain similar words, and then filtering non-electric professional words by rules to finally obtain an electric professional word stock;

2. The method according to claim 1, wherein in the first step, the power text includes a power science and technology paper, a project report, a power procedure, or a power operation manual, and the parallel corpus is a crawled wikipedia corpus.

3. The method of claim 1, wherein the vocabulary 1 creation process is as follows: keyword is extracted through the nubilation Word segmentation, the TF-IDF model and the Word2Vec Word clustering model, word combination is carried out through the textRank model, the left and right information entropy and the mutual information entropy model, and then the words are combined to obtain a Word set 1.

4. The method of claim 1, wherein the vocabulary 2 creation process is as follows:

(4) Removing redundant matters: arranging the candidate words obtained in the step (3) from more to less according to the number of words, deleting each candidate word in the word stock in sequence, segmenting the candidate word by using the rest words and word frequency, calculating the mutual information of the original word and the sub word, and determining the mutual information of the original word and the sub word according to the mutual informationIf the mutual information is greater than 1, the word is recovered, otherwise the deletion is maintained,and updating the frequency of the segmented sub words;