CN113609844A - Electric power professional word bank construction method based on hybrid model and clustering algorithm - Google Patents

Electric power professional word bank construction method based on hybrid model and clustering algorithm Download PDF

Info

Publication number
CN113609844A
CN113609844A CN202110874173.6A CN202110874173A CN113609844A CN 113609844 A CN113609844 A CN 113609844A CN 202110874173 A CN202110874173 A CN 202110874173A CN 113609844 A CN113609844 A CN 113609844A
Authority
CN
China
Prior art keywords
word
words
electric power
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110874173.6A
Other languages
Chinese (zh)
Other versions
CN113609844B (en
Inventor
陈文刚
宰洪涛
刘建国
张轲
许泳涛
何洪英
罗滇生
尹希浩
奚瑞瑶
符芳育
方杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Original Assignee
Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd filed Critical Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority to CN202110874173.6A priority Critical patent/CN113609844B/en
Publication of CN113609844A publication Critical patent/CN113609844A/en
Application granted granted Critical
Publication of CN113609844B publication Critical patent/CN113609844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Water Supply & Treatment (AREA)
  • Public Health (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of artificial intelligence, in particular to a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm. Preprocessing an electric text and parallel language materials, and then segmenting words by a Word segmentation model, wherein mutual information, a left-right entropy algorithm and a TextRank algorithm are used for carrying out Word combination on the result of the segmentation of the crust words, a TF-IDF algorithm and a Word2Vec Word clustering algorithm are used for extracting text keywords from the result of the segmentation of the crust words, the information entropy segmentation algorithm is used for directly segmenting the text words, and the results are summarized and compared to obtain characteristic language material words; selecting electric power professional vocabularies from the characteristic corpus words as seed words; meanwhile, the derived electric text word bank is used as a candidate word to divide the electric text into words, and then a word2vec algorithm is used for changing the words into word vectors; clustering to obtain similar words, and then filtering according to rules to obtain a power professional word bank. According to the invention, most of professional words in non-electric power fields can be filtered by using one clustering model, and the professional words are complete.

Description

Electric power professional word bank construction method based on hybrid model and clustering algorithm
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm.
Background
In the chinese language, the ideographic ability of a word is poor, meaning is scattered, and the ideographic ability of a word is strong, so that an object can be described more accurately, and therefore, in the natural language processing, a word (including a word formation) is the most basic processing unit in general. For languages of Latin languages such as English, the words can be simply and accurately extracted because the blank spaces between the words are used as word margin representation. In the Chinese language, except for punctuation marks, characters are closely connected without obvious boundaries, so that words are difficult to extract. The Chinese word segmentation method is roughly divided into three types: dictionary-based segmentation, statistical model-based segmentation and rule-based segmentation. Dictionary-based segmentation is a relatively common and efficient word segmentation mode, and the premise is that a word bank is provided.
At present, a relatively complete electric power professional word bank is not established in the electric power professional field. With the increase of the demand for semantic understanding of the power text, the demand for constructing word banks in the field of power major is more and more urgent. The field of power technology accumulates a large amount of text data including power science articles, project reports, power regulations, power operation manuals, and the like. Based on the data, the natural language processing technology is utilized to develop the vocabulary discovery research of the electric power professional field, and further, the dictionary of the electric power professional field is constructed, so that the method has important significance for the subsequent development of text understanding, mining and information management of the electric power field. However, since the text mining technology belongs to a new technology appearing in the field of artificial intelligence in recent years, the word segmentation discovery and word bank construction technology also belongs to a new leading-edge field in the domestic electric power professional field, most researches are still in a research test stage, and the application effect is not yet shown.
The Chinese language is different from most western languages, no obvious space mark exists between words of written Chinese, sentences appear in the form of character strings, and the first step of processing the Chinese language is to automatically divide words, namely, the character strings are converted into word strings. The Chinese language with the same syntax is complex and changeable, and has the characteristics of intersection ambiguity, combination ambiguity, ambiguity which cannot be solved in sentences, unknown words and the like in Chinese, so that the Chinese word segmentation is difficult. If the language processing task is to be completed well, word segmentation operation is firstly needed when Chinese data mining is performed. The existing common word segmentation methods are all based on an artificial word stock, and some common words can be manually collected into the word stock, but the existing common word segmentation methods cannot deal with the endless new words, especially the network new words. Which is often a key place for the task of language segmentation. Therefore, one core task of Chinese word segmentation is to perfect a new word discovery algorithm. And (4) new word discovery, namely, automatically discovering language segments which can become words directly from a large-scale corpus without adding any prior materials.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm. The method can overcome the defects of word segmentation algorithm in the word stock construction technology in the field of electric power specialty, and has the function of mining new words for electric power text data.
The scheme of the invention comprises the following steps:
preprocessing an electric text and a parallel corpus, and removing blanks, punctuations and non-entity meaning words to obtain qualified input text data;
step two, segmenting words of the electric power text and the parallel linguistic data which are not in the electric power major through a word segmentation model to obtain an electric power text word bank and a parallel linguistic data word bank, and comparing the electric power text word bank with the parallel linguistic data word bank to obtain characteristic linguistic data words;
thirdly, selecting electric power professional vocabularies from the characteristic corpus words as seed words; meanwhile, the electric text word bank derived in the second step is used as a candidate word to divide the electric text word, and then a word2vec algorithm is used for changing the word into a word vector;
and step four, inputting the word vectors and the seed words into a clustering model, clustering to obtain the words in the electric power professional field, then regularly filtering out non-electric power professional words, and finally obtaining the electric power professional word bank.
In the first step, the electric power text includes an electric power science and technology paper, a project report, an electric power regulation, an electric power operation manual and the like, and the parallel corpus can adopt a crawled corpus in wikipedia.
In the Word segmentation model, a Word set 1 is obtained through a TF-IDF statistical model, a Word2Vec Word clustering model, a TextRank model and a left-right information entropy and mutual information entropy model based on the Jieba segmentation, a Word set 2 is established through frequency, solidity and freedom, and finally the two Word sets are combined to obtain a final Word bank.
The word set 1 is established as follows: and (3) performing Word combination through a TextRank model, a left-right information entropy and mutual information entropy model, and merging the words to obtain a Word set 1.
The word set 2 is established as follows:
(1) counting: counting the frequency (P) of each word from the corpusa,Pb) And counting the co-occurrence frequency (P) of two adjacent wordsab);
(2) Cutting: setting threshold values min _ p of occurrence frequency respectivelyrob and threshold value min _ pmi for mutual information, and then combining P in corpusab<min _ prob or
Figure BDA0003190061440000031
Cutting the adjacent characters;
(3) cutting: after the segmentation in the step (2), counting the frequency P of each quasi word obtained in the step (2)w′Retaining only Pw′>The min _ prob moiety;
(4) redundancy removal: arranging the candidate words obtained in the step (3) from multiple to few according to the number of words, then deleting each candidate word in a word bank in sequence, dividing the candidate word into words by using the rest words and word frequency, calculating the mutual information of the original word and the sub-words, and according to the mutual information
Figure BDA0003190061440000032
If the mutual information is more than 1, recovering the word, otherwise, keeping deleting and updating the frequency of the segmented sub-words;
(5) counting: the words obtained after the redundancy removal in the step 4 are counted based on the word set to obtain the left information entropy of each word
Figure BDA0003190061440000033
Entropy of right information
Figure BDA0003190061440000034
And n and m are respectively the number of the left adjacent characters and the right adjacent characters of each word, the free application degree of the text segment is defined as the smaller value of the left adjacent character information entropy and the right adjacent character information entropy, a threshold value min _ pdof of the degree of freedom is set, and if the degree of freedom is greater than the threshold value, the segment is considered to be independent into words.
The method is used for segmenting the document text in the electric power field based on the hybrid model, the segmentation of the segmented words accords with Chinese semantics, and the segmentation task can be effectively completed. Compared with the single model, the mixed model is used, so that the word stock is more completely established, and words are more abundant. The words extracted based on the mixed model comprise part of non-electric field words, the clustering model is adopted for clustering electric field professional words, the clustering result shows that most of the non-electric field professional words can be filtered, the clustering electric field professional words are complete, and the effect is good.
Drawings
Fig. 1, a schematic diagram of a text preprocessing process.
Fig. 2 is a schematic diagram of a process of extracting a feature corpus word.
FIG. 3, word segmentation model.
Fig. 4 is a schematic diagram of a process of constructing a word stock in the electric power professional field.
Fig. 5, schematic view of word combination.
Detailed Description
The invention discloses a method for constructing a power professional word bank based on a hybrid model and a clustering algorithm, which comprises the following steps of:
preprocessing an electric text and a parallel corpus, namely deleting spaces, punctuations, special characters and some characters or words without entity significance in initial text data to obtain qualified input text data;
segmenting words of the electric power text and the parallel linguistic data which are not in the electric power major through a word segmentation model to obtain an electric power text word bank and a parallel linguistic data word bank, and comparing the electric power text word bank with the parallel linguistic data word bank to obtain characteristic linguistic data words in the electric power field;
step three, selecting the electric power professional vocabulary from the characteristic corpus words as seed words, wherein the characteristic corpus words still contain non-electric power professional vocabularies; meanwhile, the electric text word bank derived in the second step is used as a candidate word to divide the electric text word, and then a word2vec algorithm is used for changing the word into a word vector;
and step four, inputting the word vectors and the seed words into a clustering model, clustering to obtain the words in the electric power professional field, then regularly filtering out non-electric power professional words, and finally obtaining the electric power professional word bank.
In the text data preprocessing shown in fig. 1, a large number of spaces and punctuation marks, special characters such as%, and some words without physical meaning such as and, the same, etc., and exist in the initial power domain text data and the parallel text data. To obtain a qualified input text, the text needs to be processed accordingly. The electric power professional field text comprises an electric power scientific article, a project report, an electric power regulation, an electric power operation manual and the like, and the parallel linguistic data can be language data such as Wikipedia and people's daily report and is to be distinguished from electric power text data. In addition, the electric text data and the parallel language materials are large enough, and the built word stock can be large enough.
In the extracted feature corpus words shown in fig. 2, the electric power professional field text and the parallel corpus are subjected to word segmentation by the word segmentation model to obtain two word banks, and the two word banks are compared to obtain the feature corpus words.
In the Word segmentation model, a Word set 1 is obtained through a TF-IDF statistical model, a Word2Vec Word clustering model, a TextRank model and a left-right information entropy and mutual information entropy model based on the Jieba segmentation, a Word set 2 is established through frequency, solidity and freedom, and finally the two Word sets are combined to obtain a final Word bank.
The word set 1 is established as follows:
(1) the word segmentation of the crust: the Jieba word segmentation is a good text word segmentation tool and can accurately segment the text words, but the obtained words are small in granularity, so that most of the words in the electric power professional field are segmented, and the obtained words in the electric power professional field are not rich enough. Therefore, these small-sized words are combined to enrich the whole word stock. As shown in fig. 5:
(2) combining: the granularity of the terms of the results of the Jieba word segmentation is small, and most of the terms in the electric power professional field are detached, so the final result is obtained through the combination of the terms.
TF-IDF model extraction of keywords
The statistical model is a TF-IDF model, which is the product of two statistics. There are a number of ways to determine the specific value of the statistic.
Word frequency (TF): word frequency in the word w document d, i.e. the ratio of the number of times the word w appears in the document count (w, d) and the total word number size (d) in the document d: tf ═ count (w, d)/size (d)
Inverse Document Frequency (IDF): the inverse document frequency idf of the word w in the whole document set, i.e. the logarithm of the ratio of the total number n of documents to the number df (w, D) of the documents in which the word w appears: idf log (n/df (w, D))
Therefore, the method comprises the following steps: w is ai=tfi×idf
However, the improved TF-IDF model is used as an evaluation standard, the improved model increases the penalty of DF,
Figure BDA0003190061440000051
and b, extracting keywords in a Word2Vec word clustering mode.
1) Word2Vec Word vector representation:
the method utilizes a shallow neural network model to automatically learn the occurrence condition of words in a corpus, and embeds the words into a high-dimensional space, usually in 100-500 dimensions, where the words are expressed in the form of word vectors. The extraction of the feature word vectors is based on the word vector model which is trained.
2) K-means clustering algorithm:
the clustering algorithm aims at finding the relationship among data objects in data and grouping the data, so that the similarity in groups is as large as possible and the similarity among the groups is as small as possible.
The algorithm idea is as follows: firstly, randomly selecting K points as initial centroids, wherein K is the expected number of clusters specified by a user, assigning each point to the nearest centroid to form K clusters by calculating the distance from each point to each centroid, then recalculating the centroid of each cluster according to the points assigned to the clusters, and repeating the operation of assigning and updating the centroids until the clusters are not changed or the maximum iteration number is reached.
3) The implementation process of the Word2Vec Word cluster-based keyword extraction method comprises the following steps:
the main idea is that for words represented by Word vectors, words in an article are clustered through a K-Means algorithm, a clustering center is selected as a main keyword of the text, the distance between other words and the clustering center, namely the similarity, is calculated, the front K words closest to the clustering center are selected as the keywords, and the similarity between the words can be calculated by using vectors generated by Word2 Vec. The method comprises the following specific steps:
i. carrying out Word2Vec model training on the corpus to obtain a Word vector file;
i, preprocessing a text to obtain N candidate keywords;
i i i, traversing the candidate keywords, and extracting word vector representation of the candidate keywords from the word vector file;
performing K-Means clustering on the candidate keywords to obtain clustering centers of all categories;
v, calculating the distance (Euclidean distance or Manhattan distance) between the words in the group and the cluster center under each category, and sorting in a descending order according to the cluster size;
and vi, ranking the calculation results of the candidate keywords, and taking the K words as the text keywords.
Extracting key words from a TextRank model
The terms are regarded as nodes in the TextRank model, word relations are built, and the importance of each word is calculated according to the collinear relation among the words. The algorithm used by TextRank for keyword extraction is as follows:
the TextRank model:
Figure BDA0003190061440000061
wherein d is a damping factor, generally 0.85, In (V)i) Indicates a point ViNode of, Out (V)i) Is represented by ViThe node to which it is directed. w is aijIs represented by node Vi→VjWeight of the edge of (3), WS (V)i) Represents the weight of node i, WS (V)j) Representing the weight of node j.
1) Segmenting a given text T into complete sentences, i.e.
2) For each sentence, performing word segmentation and part-of-speech tagging, filtering out stop words, and only retaining words with specified part-of-speech, such as nouns, verbs and adjectives, i.e. tijAre the candidate keywords after retention.
3) And (E) constructing a candidate keyword graph G, wherein V is a node set and consists of the candidate keywords generated in the step (2), then constructing an edge between any two points by adopting a co-occurrence relation (co-occurrence), wherein the edges exist between the two nodes only when the corresponding words co-occur in a window with the length of K, and K represents the size of the window, namely, at most K words co-occur.
4) And according to the formula, iteratively propagating the weight of each node until convergence.
5) And carrying out reverse ordering on the node weights, thereby obtaining the most important T words as candidate keywords.
6) And 5) obtaining the most important T words, marking in the original text, and combining into a multi-word keyword if adjacent phrases are formed.
d. And extracting the multi-word key words by utilizing the left and right information entropy and the mutual information entropy.
1) Calculating mutual information: first order co-occurrences are sought and word frequency is returned. Then, second-order co-occurrence is found, and mutual information and word frequency are returned. The larger the mutual information (PMI) is, the more the a and b two words are related to each other.
2) Calculating left and right entropies; firstly, left frequency is searched, and left entropy H is countedLeft side of(x) And returns the left entropy. Then, right frequency is searched, and right entropy H is countedRight side(x) And returns the right entropy.
3) And (3) calculating the result: score as PMI + MIN (H)Left side of(x),HRight side(x) A larger score indicates a larger probability of combining words.
And finally, combining the words obtained in the steps a, b, c and d to obtain a word set 1.
In the process of establishing the word set 2, the mutual information entropy is obtained after the set degree takes logarithm, and the left and right information entropy is obtained after the degree of freedom. The word set 2 is established as follows:
(1) counting: counting the frequency (P) of each word from the corpusa,Pb) And counting the co-occurrence frequency (P) of two adjacent wordsab);
(2) Cutting: respectively setting a threshold value min _ prob of the occurrence frequency and a threshold value min _ pmi of mutual information, and then setting P in the corpusab<min _ prob or
Figure BDA0003190061440000071
Cutting the adjacent characters;
(3) cutting: cutting through the step (2)Then, counting the frequency P of each quasi word obtained in the step (2)w′Retaining only Pw′>The min _ prob moiety;
(4) redundancy removal: arranging the candidate words obtained in the step (3) from multiple to few according to the number of words, then deleting each candidate word in a word bank in sequence, dividing the candidate word into words by using the rest words and word frequency, calculating the mutual information of the original word and the sub-words, and according to the mutual information
Figure BDA0003190061440000072
If the mutual information is more than 1, recovering the word, otherwise, keeping deleting and updating the frequency of the segmented sub-words;
(5) counting: the words obtained after the redundancy removal in the step 4 are counted based on the word set to obtain the left information entropy of each word
Figure BDA0003190061440000073
Entropy of right information
Figure BDA0003190061440000074
And n and m are respectively the number of the left adjacent characters and the right adjacent characters of each word, the free application degree of the text segment is defined as the smaller value of the left adjacent character information entropy and the right adjacent character information entropy, a threshold value min _ pdof of the degree of freedom is set, and if the degree of freedom is greater than the threshold value, the segment is considered to be independent into words. And obtaining a word set 2 through the steps.
And selecting seed words from the characteristic corpus words obtained in the step two. The method comprises the steps of segmenting words of a text in the field of electric power major to obtain a Word set serving as a candidate Word, segmenting the words of the electric power text, and then training the words by using a Word vector model to obtain a Word vector, wherein the Word vector model is a Word2Vec model. And then clustering the words according to the word vectors obtained by the word vector model, wherein the clustering is performed according to the picked seed words and then a batch of similar words are found. The algorithm uses similar transitivity (somewhat similar to a clustering algorithm based on connectivity), i.e., a is similar to B and B is also similar to C, A, B, C are grouped together (even if A, C is not similar from the perspective). Of course, such a transfer would likely traverse the entire vocabulary, and would therefore be progressively enhancedWith similar limitations. For example, a is a seed word, B, C is not a seed word, A, B similarity is defined as 0.6 and B, C similarity is greater than 0.7 to be considered similar. The similarity threshold calculation formula is as follows: simi=k+d×(1-e-d×i) I is the number of transmission times, k is the initial similarity threshold, and d is generally 0.2-0.5.
Since the foregoing is purely unsupervised, even if semantic clustering is performed, some non-electric professional words can be extracted, and even some "non-words" are retained, filtering by using rules is required. And finally, filtering the clustered words through rules to obtain a final result.
The method of the invention is shown as follows according to the results of the process:
(1) partial word results based on frequency, degree of consolidation, degree of freedom segmentation: the system comprises a power supply, a power supply, and a power supply, a power supply, a power supply, a power.
(2) Extracting partial results of the keywords based on a Jieba word segmentation statistical model: coordination, safety measures, oil engine, appropriateness, trip, person on duty, power line, equivalent circuit, flammability, explosiveness, caution, spirality, transformer factory, national grid, detection technology, model, oil stain, gear, magnetic flux, insulation paper, plastic cloth, firm, switch cabinet, dc power supply, electric power company, fire-fighting measures, etc.
(3) Extracting a keyword part result based on a Word2Vec clustering model of Jieba Word segmentation: specification, leakage, data, teaching, primary parameters, ammeter, grounding device, high energy, control measures, filter, computer, carbon monoxide, inspection, operation, equipment, circuit breaker, gas, bus, process, personnel, body, operation, trip, direct current, responsible person, interior, iron core, etc
(4) Extracting a key phrase part result based on a TextRank model of the Jieba participle: power-off protection personnel, in-hole, national grid company, blocking mechanism box indication, meter chromatographic analysis data, load operation, transformer overhaul, outlet short circuit impact, grounding disconnecting link, discharge burning trace, action transformer, manufacturing quality problem, linear algorithm, electric energy loss, sleeve external insulation, contact iron core, dendritic discharge trace and the like.
(5) Extracting a key phrase part result based on a mutual information entropy and left and right information entropy model of the Jieba participle: the transformer comprises a tap changer, a main transformer, a low-voltage winding, a winding deformation, a low-voltage circuit breaker, a hanging cover inspection, an oil replacement pump, a transposed conductor, a circuit disconnection, a gas production rate, a dip transformer, a bus voltage, a neutral point sleeve, power equipment preventability, a vacuum oil filter, a dry reactor, a cylindrical winding and the like.
(6) And (3) based on the final partial result of the clustering model: main magnetic flux, sensor, oil conservator, charger, cooler, manufacturer, double bus, transformer, respirator, main transformer, # main transformer, circuit breaker, low voltage circuit breaker, circuit breaker trip, low voltage impedance, high voltage winding, winding deformation, three phase winding, arc discharge, oil replacement pump, on-load tap changer, bus voltage, transformer fault diagnosis, bus protection device, secondary main switch, insulation pad, fuse cutout, capacitor core, winding capacitance, series reactor, switch core, relay baffle, power cord power tool, AC charger, magnetic current, interference pulse, parallel conductor, bus voltage, characteristic gas, electrical equipment, power switch, DC power, air switch, tap changer, vacuum oil filter, oil paper capacitor sleeve, double bus, transformer, respirator, and electrical equipment, The method comprises the following steps of insulating oil chromatographic analysis, frequency response analysis, light gas protection action, heavy gas protection action, infrared thermal imaging detection, a position alarm lamp, pressure pulse analysis, a non-excitation tap switch, flammable and explosive articles, infrared thermal imaging and the like.
The statistical model and the Word2Vec Word clustering model extract keywords based on the results of the crust segmentation, and the results can be found to contain non-electric words and have fine Word granularity. The mutual information and left and right information entropy model and the TextRank model extract key phrases based on the results of the segmentation of the Chinese character, and the problems of fine granularity of the segmentation of the Chinese character and wrong splitting of the field words are solved. The word segmentation result can be seen to be accurate and good in effect based on the word segmentation model of the information entropy. The clustering algorithm is characterized in that the word segmentation results of the model are summarized, then words in the power field are clustered, the clustering result can be found to screen out non-power words, and the clustering effect is obvious. The models are mutually cooperated, and finally the built word stock is complete.
The invention provides a word segmentation method based on information entropy. The invention adopts the method of the minimum information entropy principle to perform word segmentation on the electric power text, thereby realizing the function of more accurate word segmentation. The method comprises the steps of firstly processing the electric text by using the frequency and the solidity, screening out quasi words, then screening out quasi words by using the freedom degree, preliminarily establishing a word bank and improving the word segmentation accuracy.
The invention provides a method for word recombination. Because the Word segmentation result based on the crust is small in Word granularity after Word segmentation, words in some electric power professional fields are separated and segmented, and non-professional words exist, the Word segmentation result is subjected to keyword extraction and Word combination by using a statistical model, a Word2Vec clustering model, a left-right information entropy and mutual information entropy model and a TextRank model, and a Word bank is enriched and perfected.
The invention provides a word clustering method. The clustering rule is that a group of similar words are found according to a plurality of picked seed words, and the method can cluster words which accord with the electric power professional field from the word bank, so that the electric power professional field word bank is enriched and perfected.
Compared with the existing word stock construction method, the invention has the advantages that:
1. aiming at the defects that the granularity of the crust participle is small and the field words are wrongly split, the mutual information and left and right information entropy word combination algorithm and the TextRank algorithm carry out word combination on the gas participle result, more electric field words are found, and the problems are solved. Mutual information and left-right entropy combination judgment are designed, whether words are combined or not is stricter, and the accuracy of word combination is improved.
The TF-IDF algorithm and the Word2Vec Word clustering algorithm are used for weighting keywords for the segmentation words of the bus, and can extract important words in the text, namely words in part of the electric power field. An improved TF-IDF algorithm is designed, the punishment degree of keyword extraction is increased, and keyword extraction is more accurate.
3. The information entropy word segmentation algorithm designs three thresholds, namely word frequency, mutual information, left and right information, strict word formation judgment, improves word segmentation accuracy, summarizes the word segmentation results of the model, supplements the word segmentation results, and establishes a more complete candidate word bank
4. The word clustering algorithm operation is a clustering field word, and can cluster electric field words for the word segmentation result, filter non-electric field words, reduce manual workload, establish a word bank more simply and conveniently, and establish a word bank more completely and better.

Claims (5)

1. A method for constructing a power professional word bank based on a hybrid model and a clustering algorithm is characterized by comprising the following steps:
preprocessing an electric power text and a non-electric power professional parallel corpus, and removing blank spaces, punctuations and non-entity meaning words to obtain qualified input text data;
step two, segmenting words of the electric power text and the parallel linguistic data through a word segmentation model to obtain an electric power text word bank and a parallel linguistic data word bank, and comparing the electric power text word bank with the parallel linguistic data word bank to obtain characteristic linguistic data words;
thirdly, selecting electric power professional vocabularies from the characteristic corpus words as seed words; meanwhile, segmenting the electric text by using the electric text word bank derived in the second step, and then changing words into word vectors by using a word2vec algorithm;
and step four, inputting the word vectors and the seed words into a clustering model, clustering to obtain similar words, then regularly filtering out non-electric power professional words, and finally obtaining an electric power professional word bank.
2. The construction method according to claim 1, wherein in the first step, the electric text includes an electric power science paper, a project report, an electric power regulation or an electric power operation manual, and the parallel corpus adopts a crawled Wikipedia corpus.
3. The construction method according to claim 1, wherein in the second step, in the Word segmentation model, a Word set 1 is obtained through a TF-IDF statistical model, a Word2Vec Word clustering model, a TextRank model and a left-right information entropy and mutual information entropy model based on Jieba Word segmentation, a Word set 2 is established through frequency, solidity and freedom, and finally, the two Word sets are combined to obtain a final Word bank.
4. The method of claim 3, wherein the word set 1 is established as follows: and (3) performing Word combination through a TextRank model, a left-right information entropy and mutual information entropy model, and merging the words to obtain a Word set 1.
5. The method of claim 3, wherein the word set 2 is established as follows:
(1) counting: counting the frequency (P) of each word from the corpusa,Pb) And counting the co-occurrence frequency (P) of two adjacent wordsab);
(2) Cutting: respectively setting a threshold value min _ prob of the occurrence frequency and a threshold value min _ pmi of mutual information, and then setting P in the corpusab<min _ prob or
Figure FDA0003190061430000011
Cutting the adjacent characters;
(3) cutting: after the segmentation in the step (2), counting the frequency P of each quasi word obtained in the step (2)w′Retaining only Pw′>The min _ prob moiety;
(4) redundancy removal: arranging the candidate words obtained in the step (3) from multiple to few according to the number of words, then deleting each candidate word in a word bank in sequence, dividing the candidate word into words by using the remaining words and word frequency, and calculating the mutual trust between the original word and the sub-wordInformation according to
Figure FDA0003190061430000021
If the mutual information is more than 1, recovering the word, otherwise, keeping deleting and updating the frequency of the segmented sub-words;
(5) counting: the words obtained after the redundancy removal in the step 4 are counted based on the word set to obtain the left information entropy of each word
Figure FDA0003190061430000022
Entropy of right information
Figure FDA0003190061430000023
And n and m are respectively the number of the left adjacent characters and the right adjacent characters of each word, the free application degree of the text segment is defined as the smaller value of the left adjacent character information entropy and the right adjacent character information entropy, a threshold min _ pdof degree of freedom is set, and if the degree of freedom is greater than the threshold, the segment is considered to be independent into words.
CN202110874173.6A 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm Active CN113609844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110874173.6A CN113609844B (en) 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110874173.6A CN113609844B (en) 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Publications (2)

Publication Number Publication Date
CN113609844A true CN113609844A (en) 2021-11-05
CN113609844B CN113609844B (en) 2024-03-08

Family

ID=78338878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110874173.6A Active CN113609844B (en) 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Country Status (1)

Country Link
CN (1) CN113609844B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154484A (en) * 2021-11-12 2022-03-08 中国长江三峡集团有限公司 Construction professional term library intelligent construction method based on mixed depth semantic mining
CN114168731A (en) * 2021-11-29 2022-03-11 北京智美互联科技有限公司 Internet media flow safety protection method and system
CN117952089A (en) * 2024-03-26 2024-04-30 广州源高网络科技有限公司 Intelligent data processing method and system for real-world clinical research
CN117953875A (en) * 2024-03-27 2024-04-30 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000035963A (en) * 1998-07-17 2000-02-02 Nec Corp Automatic sentence classification device and method
US20060206313A1 (en) * 2005-01-31 2006-09-14 Nec (China) Co., Ltd. Dictionary learning method and device using the same, input method and user terminal device using the same
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106649666A (en) * 2016-11-30 2017-05-10 浪潮电子信息产业股份有限公司 Left-right recursion-based new word discovery method
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110990567A (en) * 2019-11-25 2020-04-10 国家电网有限公司 Electric power audit text classification method for enhancing domain features
WO2020073523A1 (en) * 2018-10-12 2020-04-16 平安科技(深圳)有限公司 New word recognition method and apparatus, computer device, and computer readable storage medium
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN112732934A (en) * 2021-01-11 2021-04-30 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000035963A (en) * 1998-07-17 2000-02-02 Nec Corp Automatic sentence classification device and method
US20060206313A1 (en) * 2005-01-31 2006-09-14 Nec (China) Co., Ltd. Dictionary learning method and device using the same, input method and user terminal device using the same
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106649666A (en) * 2016-11-30 2017-05-10 浪潮电子信息产业股份有限公司 Left-right recursion-based new word discovery method
WO2020073523A1 (en) * 2018-10-12 2020-04-16 平安科技(深圳)有限公司 New word recognition method and apparatus, computer device, and computer readable storage medium
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110990567A (en) * 2019-11-25 2020-04-10 国家电网有限公司 Electric power audit text classification method for enhancing domain features
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN112732934A (en) * 2021-01-11 2021-04-30 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘伟童;刘培玉;刘文锋;李娜娜;: "基于互信息和邻接熵的新词发现算法", 计算机应用研究, no. 05, 14 March 2018 (2018-03-14) *
吴润泽;蔡永涛;陈文伟;陈文刚;王一蓉;: "面向多源异构数据源的实际范围索引树索引方法", 电力系统自动化, no. 11 *
孙晓;孙重远;任福继;: "基于深层结构模型的新词发现与情感倾向判定", 计算机科学, no. 09 *
王一蓉;王瑞杰;陈文刚;吴润泽;: "基于遗传优化的调控系统缺失数据填补算法", 电力系统保护与控制, no. 21 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154484A (en) * 2021-11-12 2022-03-08 中国长江三峡集团有限公司 Construction professional term library intelligent construction method based on mixed depth semantic mining
CN114168731A (en) * 2021-11-29 2022-03-11 北京智美互联科技有限公司 Internet media flow safety protection method and system
CN117952089A (en) * 2024-03-26 2024-04-30 广州源高网络科技有限公司 Intelligent data processing method and system for real-world clinical research
CN117953875A (en) * 2024-03-27 2024-04-30 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding

Also Published As

Publication number Publication date
CN113609844B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN110717047B (en) Web service classification method based on graph convolution neural network
Li et al. A co-attention neural network model for emotion cause analysis with emotional context awareness
CN111950264B (en) Text data enhancement method and knowledge element extraction method
CN108763204A (en) A kind of multi-level text emotion feature extracting method and model
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN106776562A (en) A kind of keyword extracting method and extraction system
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN113609844B (en) Electric power professional word stock construction method based on hybrid model and clustering algorithm
Huang et al. New word detection for sentiment analysis
CN109815400A (en) Personage&#39;s interest extracting method based on long text
CN108197175B (en) Processing method and device of technical supervision data, storage medium and processor
Rahimi et al. An overview on extractive text summarization
CN105930509A (en) Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
Nabil et al. Labr: A large scale arabic sentiment analysis benchmark
CN110399606A (en) A kind of unsupervised electric power document subject matter generation method and system
CN104317783B (en) The computational methods that a kind of semantic relation is spent closely
CN106598941A (en) Algorithm for globally optimizing quality of text keywords
Keikha et al. Rich document representation and classification: An analysis
Hassan et al. Automatic document topic identification using wikipedia hierarchical ontology
Zhang et al. A machine learning-based approach for building code requirement hierarchy extraction
Zhang et al. Domain-specific term extraction from free texts
CN112926340B (en) Semantic matching model for knowledge point positioning
Meng et al. Research on Short Text Similarity Calculation Method for Power Intelligent Question Answering
Tohalino et al. Using virtual edges to extract keywords from texts modeled as complex networks
Feng et al. Research on Barrage Sentiment Ontology Construction Based on SO-PMI Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant