CN113609844B - Electric power professional word stock construction method based on hybrid model and clustering algorithm - Google Patents

Electric power professional word stock construction method based on hybrid model and clustering algorithm Download PDF

Info

Publication number
CN113609844B
CN113609844B CN202110874173.6A CN202110874173A CN113609844B CN 113609844 B CN113609844 B CN 113609844B CN 202110874173 A CN202110874173 A CN 202110874173A CN 113609844 B CN113609844 B CN 113609844B
Authority
CN
China
Prior art keywords
word
words
model
text
electric
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110874173.6A
Other languages
Chinese (zh)
Other versions
CN113609844A (en
Inventor
陈文刚
宰洪涛
刘建国
张轲
许泳涛
何洪英
罗滇生
尹希浩
奚瑞瑶
符芳育
方杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Original Assignee
Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd filed Critical Jincheng Power Supply Co of State Grid Shanxi Electric Power Co Ltd
Priority to CN202110874173.6A priority Critical patent/CN113609844B/en
Publication of CN113609844A publication Critical patent/CN113609844A/en
Application granted granted Critical
Publication of CN113609844B publication Critical patent/CN113609844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of artificial intelligence, in particular to a method for constructing an electric power professional word stock based on a mixed model and a clustering algorithm. Preprocessing an electric text and a parallel corpus, and then performing Word segmentation through a Word segmentation model, wherein mutual information, a left-right entropy algorithm and a TextRank algorithm perform Word combination on a resultant Word segmentation result, a TF-IDF algorithm and a Word2Vec Word clustering algorithm extract text keywords from the resultant Word segmentation result, the text Word is directly segmented by the information entropy Word segmentation algorithm, and the results are summarized and compared to obtain characteristic corpus words; selecting an electric professional vocabulary from the feature corpus words as a seed word; meanwhile, the derived electric text word stock is used as a candidate word to segment the electric text, and then word2vec algorithm is used to change words into word vectors; clustering to obtain similar words, and then filtering to obtain a power professional word stock. According to the invention, most of professional words in the non-electric power field can be filtered by using one clustering model, and the professional words are complete.

Description

Electric power professional word stock construction method based on hybrid model and clustering algorithm
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method for constructing an electric power professional word stock based on a mixed model and a clustering algorithm.
Background
In Chinese language, single word has worse ideographic ability and more dispersed meaning, and word has stronger ideographic ability, which can describe a thing more accurately, therefore, in natural language processing, words (including single word forming) are the most basic processing units in general. For Latin languages such as English, the words can be simply and accurately extracted because spaces are arranged among the words to be used as word margin representation. The Chinese language is characterized in that words are closely connected with each other except punctuation marks, and no obvious boundary exists, so that words are difficult to extract. The Chinese word segmentation method is roughly divided into three types: dictionary-based segmentation, statistical model-based segmentation and rule-based segmentation. Dictionary-based segmentation is a relatively common and efficient word segmentation mode, and the premise is that a word stock is needed.
The power profession field does not currently establish a complete power profession word stock. With the increase of the semantic understanding demands on the electric power text, the demands for constructing word stock in the electric power professional field are more and more urgent. The power profession field accumulates a large amount of text data including power science and technology papers, project reports, power regulations, power operation manuals, and the like. Based on the data, the natural language processing technology is utilized to develop the vocabulary discovery research in the power professional field, so that a dictionary in the power professional field is constructed, and the method has important significance for the text understanding, mining and information management in the power field. However, since the text mining technology belongs to a new technology in the field of artificial intelligence, which appears in recent years, the word segmentation discovery and word stock construction technology also belongs to an emerging front field in the domestic power professional field, most of researches are still in a research test stage, and the application effect is not yet revealed.
Chinese is different from most western languages, no obvious space mark exists between words of written Chinese, sentences appear in the form of character strings, and the first step in Chinese processing is automatic word segmentation, namely, character strings are converted into word strings. The Chinese word segmentation method is complex and changeable in language, intersection ambiguity exists in Chinese, combination ambiguity exists, ambiguity which cannot be solved in sentences, and the Chinese word segmentation method has the characteristics of unregistered words and the like, so that the Chinese word segmentation is difficult. If the language processing task is to be completed well, word segmentation operation is needed first when the text data mining is performed. The existing common word segmentation methods are based on a manual word stock, and some common words can be manually collected into the word stock, but the method cannot cope with endless new words, particularly network new words. Which is often the key place for the task of word segmentation in a language. Therefore, one core task of Chinese word segmentation is to perfect a new word discovery algorithm. New word discovery, namely, automatically discovering language fragments which can form words directly from a large-scale corpus without adding any priori materials.
Disclosure of Invention
The invention aims to provide a method for constructing an electric power professional word stock based on a hybrid model and a clustering algorithm. The method can overcome the defect of word segmentation algorithm in the existing word stock construction technology in the electric power professional field, and has the function of mining new words for electric power text data.
The scheme of the invention comprises the following steps:
preprocessing an electric text and a parallel corpus, removing blank spaces, punctuation marks and words without entity meaning, and obtaining qualified input text data;
step two, word segmentation is carried out on the electric text and the parallel corpus which is not in the electric specialty through a word segmentation model, so as to obtain an electric text word stock and a parallel corpus word stock, and the electric text word stock is compared with the parallel corpus word stock to obtain characteristic corpus words;
step three, selecting electric professional vocabulary from the feature corpus words as seed words; meanwhile, the electric text word stock derived in the second step is used as a candidate word to segment the electric text, and then word2vec algorithm is used to change words into word vectors;
and step four, inputting word vectors and seed words into a clustering model, clustering to obtain words in the electric power professional field, filtering non-electric power professional words according to rules, and finally obtaining an electric power professional word stock.
In the first step, the electric power text includes an electric power science and technology paper, a project report, an electric power regulation, an electric power operation manual and the like, and the parallel corpus can adopt a crawled wikipedia corpus.
In the Word segmentation model, a Word set 1 is obtained based on the Jieba Word segmentation and through a TF-IDF statistical model, a Word2Vec Word clustering model, a textRank model, a left-right information entropy and a mutual information entropy model, a Word set 2 is built through frequency, solidification degree and degree of freedom, and finally two Word sets are combined to obtain a final Word stock.
The word set 1 is established as follows: keyword is extracted through the nubilation Word segmentation, the TF-IDF model and the Word2Vec Word clustering model, word combination is carried out through the textRank model, the left and right information entropy and the mutual information entropy model, and then the words are combined to obtain a Word set 1.
The vocabulary 2 establishment process is as follows:
(1) And (3) statistics: counting the frequency of each word from the corpus and counting the co-occurrence frequency P of two adjacent words ab
(2) Cutting: respectively setting a threshold value min_prob of the occurrence frequency and a threshold value min_pmi of mutual information, and then adding P in the corpus ab <min_prob orCutting the adjacent characters;
(3) Cutting: after the segmentation in the step (2), the frequency P of each quasi word is obtained in the statistics step (2) w′ Keep only P w′ >min_prob portion;
(4) Removing redundant matters: arranging the candidate words obtained in the step (3) from more to less according to the number of words, deleting each candidate word in the word stock in sequence, segmenting the candidate word by using the rest words and word frequency, calculating the mutual information of the original word and the sub word, and determining the mutual information of the original word and the sub word according to the mutual informationIf the mutual information is more than 1, recovering the word, otherwise, maintaining deleting and updating the frequency of the segmented sub-word;
(5) And (3) statistics: the words obtained after the step 4 redundancy removal are counted based on the word set to obtain left information entropy of each wordRight information entropy->Wherein n and m are the non-repeated numbers of the left adjacent word and the right adjacent word of each word respectively, the free application degree of the text fragment is defined as a smaller value in the left adjacent word information entropy and the right adjacent word information entropy, a threshold value min_pdofof the degree of freedom is set, and if the degree of freedom is larger than the threshold value, the fragment is considered to be independent into words.
According to the invention, the text of the literature in the electric power field is segmented based on the hybrid model, and the segmented fragments accord with Chinese semantics, so that the segmentation task can be effectively completed. The invention uses the mixed model than uses the single model, the establishment of word stock is more complete, the word is richer. The words extracted based on the mixed model contain partial non-electric field words, the clustering model is adopted to cluster the electric field professional words, and the clustering result shows that most of the non-electric field professional words can be filtered, so that the clustering electric field professional words are complete and have good effects.
Drawings
FIG. 1 is a schematic diagram of a text preprocessing process.
FIG. 2 is a schematic diagram of a process of extracting feature corpus words.
FIG. 3, word segmentation model.
Fig. 4 is a schematic diagram of a word stock construction process in the power professional field.
FIG. 5, word combination schematic.
Detailed Description
The invention discloses a method for constructing an electric power professional word stock based on a mixed model and a clustering algorithm, which comprises the following steps:
step one, preprocessing an electric text and parallel corpus, including deleting blank spaces, punctuation marks, special characters and some characters or words without entity meaning in initial text data, and obtaining qualified input text data;
step two, word segmentation is carried out on the electric power text and the parallel corpus of non-electric power major through a word segmentation model, so as to obtain an electric power text word stock and a parallel corpus word stock, and the electric power text word stock is compared with the parallel corpus word stock to obtain characteristic corpus words of the electric power field;
step three, the feature corpus words still contain non-electric professional vocabulary, and the electric professional vocabulary is selected from the feature corpus words to serve as seed words; meanwhile, the electric text word stock derived in the second step is used as a candidate word to segment the electric text, and then word2vec algorithm is used to change words into word vectors;
and step four, inputting word vectors and seed words into a clustering model, clustering to obtain words in the electric power professional field, filtering non-electric power professional words according to rules, and finally obtaining an electric power professional word stock.
In the text data preprocessing shown in fig. 1, there are a large number of spaces and punctuation marks, special characters such as%, and some words having no physical meaning such as and, it, etc., and in the initial power domain text data and the parallel corpus text data. To obtain acceptable input text, the text needs to be processed accordingly. The text in the electric power professional field comprises electric power science and technology papers, project reports, electric power regulations, electric power operation manuals and the like, and the parallel corpus can be wikipedia corpus, people daily necessities and the like, and is to be distinguished from electric power text data. In addition, the electric text data and the parallel corpus are large enough, and the built word stock can be large enough.
In the extracted feature corpus words shown in fig. 2, the text of the electric power professional field and the parallel corpus are segmented by a segmentation model to obtain two word libraries, and the two word libraries are compared to obtain feature corpus words.
In the Word segmentation model, a Word set 1 is obtained based on the Jieba Word segmentation and through a TF-IDF statistical model, a Word2Vec Word clustering model, a textRank model, a left-right information entropy and a mutual information entropy model, a Word set 2 is built through frequency, solidification degree and degree of freedom, and finally two Word sets are combined to obtain a final Word stock.
The word set 1 is established as follows:
(1) Barking segmentation: the Jieba word segmentation is a good text word segmentation tool, can segment the text word more accurately, but has smaller obtained word granularity, so that most of words in the electric power professional field are segmented, and the obtained words in the electric power professional field are not abundant enough. Therefore, to combine these small granularity words, the entire word stock is enriched. As shown in fig. 5:
(2) Combination: the granularity of the result words of the Jieba segmentation is small, and most of words in the electric power professional field are disassembled, so that a final result is obtained through word combination.
Extracting key words by TF-IDF model
The statistical model is a selected TF-IDF model, where TF-IDF is the product of two statistics. There are a number of ways in which a particular value of a statistic can be determined.
Word frequency (TF): word frequency in word w document d, i.e. the specific gravity of the number of occurrences of word w in document count (w, d) and the total number of words size (d) in document d: tf=count (w, d)/size (d)
Inverse Document Frequency (IDF): the inverse document frequency idf of the word w in the whole document collection, i.e. the logarithm of the ratio of the total number of documents n to the number of documents df (w, D) in which the word w appears: idf=log (n/df (w, D))
Therefore: w (w) i =tf i ×idf
However, the invention uses the improved TF-IDF model as an evaluation standard, the improved model increases the DF penalty,
and b, extracting keywords by a word2Vec word clustering mode.
1) Word2Vec Word vector representation:
the occurrence of words in a corpus is automatically learned by using a shallow neural network model, and the words are embedded into a high-dimensional space, usually in 100-500 dimensions, in which the words are expressed in terms of word vectors. The extraction of feature word vectors is based on a word vector model that has been trained.
2) K-means clustering algorithm:
the clustering algorithm aims to find the relationship between data objects in the data, and group the data so that the similarity in the groups is as large as possible and the similarity between the groups is as small as possible.
The algorithm idea is as follows: firstly, randomly selecting K points as initial centroids, wherein K is the number of expected clusters designated by a user, assigning each point to the nearest centroid to form K clusters by calculating the distance between each point and each centroid, then recalculating the centroid of each cluster according to the point assigned to the cluster, and repeating the operations of assigning and updating the centroids until the clusters are unchanged or the maximum iteration number is reached.
3) The implementation process of the Word2Vec Word clustering keyword extraction method comprises the following steps:
the main idea is that for words represented by Word vectors, the words in the articles are clustered through a K-Means algorithm, a clustering center is selected as a main keyword of a text, distances between other words and the clustering center, namely similarity, are calculated, the K words closest to the clustering center are selected as keywords, and the similarity among the words can be calculated by using vectors generated by Word2 Vec. The method comprises the following specific steps:
i. training a Word2Vec model of the corpus to obtain a Word vector file;
preprocessing the text to obtain N candidate keywords;
traversing the candidate keywords, and extracting word vector representations of the candidate keywords from the word vector file;
iv, carrying out K-Means clustering on the candidate keywords to obtain clustering centers of all the categories;
calculating the distance (Euclidean distance or Manhattan distance) between the words in the group and the clustering center under each category, and sorting in descending order according to the clustering size;
and thirdly, ranking the calculation results of the candidate keywords, and taking the top K words as text keywords.
Extracting key words by using textrank model
The TextRank model regards words as "nodes", constructs word relationships, and calculates the importance of each word according to the collinearly relationship between the words. The algorithm for keyword extraction by TextRank is as follows:
TextRank model:
wherein d is a damping factor, typically 0.85, in (V i ) The indication points to V i Is (V) i ) Represented by V i The node to which it points. w (w) ij Representing the result of the node V i →V j Is the edge weight of WS (V) i ) Representing the weight of node i, WS (V j ) Representing the weight of node j.
1) Segmenting a given text T according to a complete sentence, i.e
2) For each sentence, performing word segmentation and part-of-speech tagging, and filtering out stop words, retaining only words of specified part-of-speech, e.g. nouns, verbs, adjectives, i.e. where t ij Is a candidate keyword after reservation.
3) Constructing a candidate keyword graph G= (V, E), wherein V is a node set, and is composed of candidate keywords generated in the step (2), then constructing edges between any two points by adopting a co-occurrence relation (co-current), wherein edges exist between the two nodes only when corresponding vocabularies coexist in a window with the length of K, and K represents the window size, namely K words at most coexist.
4) According to the above formula, iteratively propagating the weights of the nodes until convergence.
5) And (3) sorting the node weights in a reverse order, so as to obtain the most important T words as candidate keywords.
6) The most important T words are obtained from the step 5), marked in the original text, and if adjacent phrases are formed, the multi-word keywords are formed.
d. And extracting the multi-word keywords by using the left and right information entropy and the mutual information entropy.
1) Calculating mutual information: first order co-occurrence is found and word frequency is returned. Then find the second order co-occurrence and return mutual information and word frequency. The larger the mutual information (PMI) is, the larger the correlation between the two words a and b is.
2) Calculating left and right entropy; first, find the left frequency, count the left entropy H Left side (x) And returns the left entropy. Then find the right frequency, count the right entropy H Right side (x) And returns the right entropy.
3) Calculation results: score=pmi+min (H Left side (x),H Right side (x) A larger score indicates a greater probability of combining words.
Finally, the words obtained in the steps a, b, c and d are combined to obtain a word set 1.
In the establishment process of the word set 2, the mutual information entropy is obtained after the solidification degree takes logarithm, and the degree of freedom is the left and right information entropy. The vocabulary 2 establishment process is as follows:
(1) And (3) statistics: counting the frequency of each word from the corpus and counting the co-occurrence frequency P of two adjacent words ab
(2) Cutting: respectively setting a threshold value min_prob of the occurrence frequency and a threshold value min_pmi of mutual information, and then adding P in the corpus ab <min_prob orCutting the adjacent characters;
(3) Cutting: after the segmentation in the step (2), the frequency P of each quasi word is obtained in the statistics step (2) w′ Keep only P w′ >min_prob portion;
(4) Removing redundant matters: arranging the candidate words obtained in the step (3) from more to less according to the number of words, deleting each candidate word in the word stock in sequence, segmenting the candidate word by using the rest words and word frequency, calculating the mutual information of the original word and the sub word, and determining the mutual information of the original word and the sub word according to the mutual informationIf the mutual information is more than 1, recovering the word, otherwise, maintaining deleting and updating the frequency of the segmented sub-word;
(5) And (3) statistics: the words obtained after the step 4 redundancy removal are counted based on the word set to obtain left information entropy of each wordRight information entropy->Wherein n and m are the non-repeated numbers of the left adjacent word and the right adjacent word of each word respectively, the free application degree of the text fragment is defined as a smaller value in the left adjacent word information entropy and the right adjacent word information entropy, a threshold value min_pdofof the degree of freedom is set, and if the degree of freedom is larger than the threshold value, the fragment is considered to be independent into words. The word set 2 is obtained through the steps.
And selecting seed words from the feature corpus words obtained in the step two. The method comprises the steps of segmenting a text in the electric power professional field to obtain a Word set as a candidate Word, segmenting the electric power text, and training by using a Word vector model to obtain a Word vector, wherein the Word vector model is a Word2Vec model. Then, the words are clustered according to the word vectors obtained by the word vector model, wherein the clustering is performed according to a plurality of selected seed words, and then a batch of similar words are found. The algorithm uses similar transitivity (somewhat similar to the connectivity-based clustering algorithm), i.e., a and B are similar, and B and C are similar, then A, B, C is grouped into one class (even A, C is not similar from an index perspective). Of course, such a pass down is likely to traverse the entire vocabulary, so similar constraints are progressively enforced. For example, A is a seed term, B, C is not a seed term, A, B is similarA degree of 0.6 defines it as similar, and B, C is considered similar if it is greater than 0.7. The similarity threshold calculation formula is as follows: sim (sim) i =k+d×(1-e -d×i ) I is the number of passes, k is the initial similarity threshold, and d is typically 0.2-0.5.
Since the previous is purely unsupervised, even if semantic clustering is performed, some non-power professional vocabularies can be generated, and even some non-vocabularies are reserved, so that rules are needed to be used for filtering. And finally filtering the clustered words through rules to obtain a final result.
The method of the invention is shown as follows according to the results of the flow:
(1) Based on the frequency, the solidification degree and the partial word result of the degree of freedom word segmentation: consistent, one, three-phase, expert, insulating pad, fuse blow, capacitor core, winding capacitance, series reactor, switch core, relay baffle, medium voltage side, laboratory, protection action trip, majority, solid insulating material, on-line monitoring system, differential protection action, oilpaper capacitive bushing, insulating oil chromatography, frequency response analysis, light gas protection action, etc.
(2) Extracting a keyword part result based on a statistical model of the Jieba segmentation: coordination, safety measures, oil machines, appropriateness, tripping, operators on duty, power lines, equivalent circuits, inflammability and explosiveness, notes, spirals, transformer plants, national power grids, detection technologies, models, oil stains, gears, magnetic fluxes, insulating papers, plastic cloths, firm, switch cabinets, direct current power supplies, power companies, fire protection measures and the like.
(3) Keyword part results are extracted based on Word2Vec clustering model of Jieba segmentation: specification, leakage, data, teaching, main parameters, ammeter, grounding device, high energy, control measures, filter, computer, carbon monoxide, inspection, operation, equipment, circuit breaker, gas, bus, process, personnel, body, operation, tripping, direct current, responsible person, interior, iron core, etc
(4) Extracting a keyword group part result based on a TextRank model of the Jieba segmentation: power-off protection personnel, in-hole, national grid company, gear box indication, meter chromatographic analysis data, load operation, transformer maintenance, outlet short circuit impact, grounding knife switch, discharge burning trace, action transformer, manufacturing quality problem, linear algorithm, electric energy loss, sleeve external insulation, contact iron core, dendritic discharge trace and the like.
(5) Extracting a keyword group part result based on a mutual information entropy and a left-right information entropy model of the Jieba word segmentation: tap changers, number main transformers, low-voltage windings, winding deformation, low-voltage circuit breakers, hanging cover inspection, oil substitution pumps, transposed conductors, loop disconnection, gas production rates, immersed transformers, busbar voltages, neutral point bushings, preventive power equipment, vacuum oil filters, dry reactors, cylindrical windings and the like.
(6) Based on the clustering model final partial results: the device comprises a main magnetic flux, a sensor, an oil storage tank, a charger, a cooler, a manufacturing plant, a double bus, a transformer, a respirator, a main transformer, # main transformer, a circuit breaker, a low-voltage circuit breaker, a circuit breaker tripping operation, a low voltage, a low-voltage impedance, a high-voltage winding, a winding, winding deformation, a three-phase winding, arc discharge, an oil replacement pump, an on-load voltage regulation switch, bus voltage, transformer fault diagnosis, a bus protection device, a secondary main switch, an insulating cushion block, a fuse fusing, a capacitor core, winding capacitance, a series reactor, a switch core, a relay baffle, a power line electric tool, a charger alternating current power supply, magnetic current, interference pulses, parallel wires, bus voltage, characteristic gases, electrical equipment, a power switch, a direct current power supply, an air switch, a voltage regulation switch, a vacuum oil filter, an oil paper capacitor sleeve, an insulating oil chromatographic analysis, a frequency response analysis method, a light gas protection action, a heavy gas protection action, infrared thermal imaging detection, a position alarm lamp, a voltage pulse analysis method, a non-excitation tapping switch, inflammable and infrared thermal imaging and the like.
The statistical model and the Word2Vec Word clustering model extract keywords based on the results of the crust Word segmentation, and the results can be found to contain non-electric words with fine Word granularity. The mutual information, left and right information entropy models and the TextRank model extract key word groups based on the barking word segmentation result, and the problems of fine granularity of the barking word segmentation and incorrect splitting of the domain words are solved. The word segmentation model based on the information entropy can show that the word segmentation result is accurate and the effect is good. The clustering algorithm is that the word segmentation results of the model are summarized, then the words in the electric power field are clustered, and the clustering result can be found to screen the non-electric power words, so that the clustering effect is obvious. The models are mutually cooperated, and the finally established word stock is complete.
The invention provides a word segmentation method based on information entropy. The invention adopts the method of the minimum information entropy principle to segment the electric text, thereby realizing the more accurate word segmentation effect. According to the method, firstly, the frequency and the solidification degree are utilized to process the electric power text, the quasi words are screened out, then the degree of freedom is utilized to screen the quasi words again, a word stock is initially built, and the word segmentation accuracy is improved.
The invention provides a word recombination method. Because Word segmentation results are based on the crust, because Word granularity is small after Word segmentation, words in some electric power professional fields are disassembled for Word segmentation, and some non-professional words exist, keyword extraction and Word combination are carried out on the Word segmentation results by using a statistical model, a Word2Vec clustering model, a left-right information entropy and mutual information entropy model and a TextRank model, so that Word libraries are enriched and perfected.
The invention provides a word clustering method. The clustering rule is that according to a plurality of selected seed words, a batch of similar words are found, words conforming to the electric power professional field can be clustered from the word stock, and the word stock of the electric power professional field is enriched and perfected.
Compared with the existing word stock construction method, the method has the advantages that:
1. aiming at the defect that the resultant word segmentation granularity is small and the domain words are split by mistake, a mutual information and left and right information entropy word combination algorithm and a TextRank algorithm are used for carrying out word combination on the gas word segmentation result, more electric domain words are found, and the problems are solved. The method is characterized in that mutual information and left and right entropy combination judgment is designed, whether words are combined is stricter, the accuracy of word combination is improved, the TextRank algorithm method is used for extracting keywords, and the keywords are combined in a ranking mode, so that word combination is more accurate.
The TF-IDF algorithm and the Word2Vec Word clustering algorithm are keyword for proposing weight to the barker Word, and can extract important words in the text, namely partial electric field words. An improved TF-IDF algorithm is designed, penalty force of keyword extraction is increased, and keyword extraction is more accurate.
3. The information entropy word segmentation algorithm designs three thresholds, namely word frequency, mutual information, left and right information, strict word forming judgment, improved word segmentation accuracy, summarized word segmentation results of the model, mutually complemented word segmentation results, and more complete candidate word library
4. The word clustering algorithm operation is the word in the clustering field, and can cluster the words in the electric power field to the word segmentation result, filter the words in the non-electric power field, reduce the manual work load, build the word stock more succinctly and conveniently, and build the word stock more fully and better.

Claims (4)

1. The electric power professional word stock construction method based on the mixed model and the clustering algorithm is characterized by comprising the following steps of:
preprocessing electric power texts and parallel corpora without electric power major, and removing space, punctuation marks and non-entity meaning words to obtain qualified input text data;
step two, word segmentation is carried out on the electric text and the parallel corpus through a word segmentation model, so that an electric text word stock and a parallel corpus word stock are obtained, and the electric text word stock is compared with the parallel corpus word stock to obtain characteristic corpus words;
step three, selecting electric professional vocabulary from the feature corpus words as seed words; meanwhile, the electric text word library derived in the second step is used for word segmentation of the electric text, and then word2vec algorithm is used for changing words into word vectors;
step four, inputting word vectors and seed words into a clustering model, clustering to obtain similar words, and then filtering non-electric professional words by rules to finally obtain an electric professional word stock;
in the Word segmentation model, a Word set 1 is obtained based on the Jieba Word segmentation and through a TF-IDF statistical model, a Word2Vec Word clustering model, a textRank model, a left-right information entropy and a mutual information entropy model, a Word set 2 is built through frequency, solidification degree and degree of freedom, and finally two Word sets are combined to obtain a final Word stock.
2. The method according to claim 1, wherein in the first step, the power text includes a power science and technology paper, a project report, a power procedure, or a power operation manual, and the parallel corpus is a crawled wikipedia corpus.
3. The method of claim 1, wherein the vocabulary 1 creation process is as follows: keyword is extracted through the nubilation Word segmentation, the TF-IDF model and the Word2Vec Word clustering model, word combination is carried out through the textRank model, the left and right information entropy and the mutual information entropy model, and then the words are combined to obtain a Word set 1.
4. The method of claim 1, wherein the vocabulary 2 creation process is as follows:
(1) And (3) statistics: counting the frequency of each word from the corpus and counting the co-occurrence frequency P of two adjacent words ab
(2) Cutting: respectively setting a threshold value min_prob of the occurrence frequency and a threshold value min_pmi of mutual information, and then adding P in the corpus ab <min_prob orCutting the adjacent characters;
(3) Cutting: after the segmentation in the step (2), the frequency P of each quasi word is obtained in the statistics step (2) w′ Keep only P w′ >min_prob portion;
(4) Removing redundant matters: arranging the candidate words obtained in the step (3) from more to less according to the number of words, deleting each candidate word in the word stock in sequence, segmenting the candidate word by using the rest words and word frequency, calculating the mutual information of the original word and the sub word, and determining the mutual information of the original word and the sub word according to the mutual informationIf the mutual information is greater than 1, the word is recovered, otherwise the deletion is maintained,and updating the frequency of the segmented sub words;
(5) And (3) statistics: the words obtained after the step 4 redundancy removal are counted based on the word set to obtain left information entropy of each wordRight information entropy->Wherein n and m are the non-repeated numbers of the left adjacent word and the right adjacent word of each word respectively, the free application degree of the text fragment is defined as a smaller value in the left adjacent word information entropy and the right adjacent word information entropy, a threshold value min_pdofof the degree of freedom is set, and if the degree of freedom is larger than the threshold value, the fragment is considered to be independent into words.
CN202110874173.6A 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm Active CN113609844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110874173.6A CN113609844B (en) 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110874173.6A CN113609844B (en) 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Publications (2)

Publication Number Publication Date
CN113609844A CN113609844A (en) 2021-11-05
CN113609844B true CN113609844B (en) 2024-03-08

Family

ID=78338878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110874173.6A Active CN113609844B (en) 2021-07-30 2021-07-30 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Country Status (1)

Country Link
CN (1) CN113609844B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154484B (en) * 2021-11-12 2023-01-06 中国长江三峡集团有限公司 Construction professional term library intelligent construction method based on mixed depth semantic mining
CN114168731B (en) * 2021-11-29 2024-06-28 北京国瑞数智技术有限公司 Internet media flow safety protection method and system
CN117952089B (en) * 2024-03-26 2024-06-14 广州源高网络科技有限公司 Intelligent data processing method and system for real-world clinical research
CN117953875B (en) * 2024-03-27 2024-06-28 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding
CN118535739A (en) * 2024-06-26 2024-08-23 上海建朗信息科技有限公司 Data classification method and system based on keyword weight matching

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000035963A (en) * 1998-07-17 2000-02-02 Nec Corp Automatic sentence classification device and method
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106649666A (en) * 2016-11-30 2017-05-10 浪潮电子信息产业股份有限公司 Left-right recursion-based new word discovery method
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110990567A (en) * 2019-11-25 2020-04-10 国家电网有限公司 Electric power audit text classification method for enhancing domain features
WO2020073523A1 (en) * 2018-10-12 2020-04-16 平安科技(深圳)有限公司 New word recognition method and apparatus, computer device, and computer readable storage medium
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN112732934A (en) * 2021-01-11 2021-04-30 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100530171C (en) * 2005-01-31 2009-08-19 日电(中国)有限公司 Dictionary learning method and devcie

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000035963A (en) * 1998-07-17 2000-02-02 Nec Corp Automatic sentence classification device and method
CN104199972A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Named entity relation extraction and construction method based on deep learning
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106649666A (en) * 2016-11-30 2017-05-10 浪潮电子信息产业股份有限公司 Left-right recursion-based new word discovery method
WO2020073523A1 (en) * 2018-10-12 2020-04-16 平安科技(深圳)有限公司 New word recognition method and apparatus, computer device, and computer readable storage medium
CN110334345A (en) * 2019-06-17 2019-10-15 首都师范大学 New word discovery method
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110990567A (en) * 2019-11-25 2020-04-10 国家电网有限公司 Electric power audit text classification method for enhancing domain features
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN112732934A (en) * 2021-01-11 2021-04-30 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于互信息和邻接熵的新词发现算法;刘伟童;刘培玉;刘文锋;李娜娜;;计算机应用研究;20180314(第05期);全文 *
基于深层结构模型的新词发现与情感倾向判定;孙晓;孙重远;任福继;;计算机科学(09);全文 *
基于遗传优化的调控系统缺失数据填补算法;王一蓉;王瑞杰;陈文刚;吴润泽;;电力系统保护与控制(21);全文 *
面向多源异构数据源的实际范围索引树索引方法;吴润泽;蔡永涛;陈文伟;陈文刚;王一蓉;;电力系统自动化(11);全文 *

Also Published As

Publication number Publication date
CN113609844A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN113609844B (en) Electric power professional word stock construction method based on hybrid model and clustering algorithm
Peng et al. Mathbert: A pre-trained model for mathematical formula understanding
Ruder et al. A hierarchical model of reviews for aspect-based sentiment analysis
CN111950264B (en) Text data enhancement method and knowledge element extraction method
CN112699246A (en) Domain knowledge pushing method based on knowledge graph
CN110377901B (en) Text mining method for distribution line trip filling case
CN110888973B (en) Method for automatically structuring and carding monitoring information table
CN108197175A (en) The treating method and apparatus of technical supervision data, storage medium, processor
CN112926340B (en) Semantic matching model for knowledge point positioning
Wang et al. Short text mining framework with specific design for operation and maintenance of power equipment
CN109101483A (en) A kind of wrong identification method for electric inspection process text
CN105955960B (en) Grounding grid defect text mining method based on semantic frame
Yang et al. Ontology generation for large email collections.
Trappey et al. Intelligent RFQ summarization using natural language processing, text mining, and machine learning techniques
Alsubhi et al. Pre-trained transformer-based approach for arabic question answering: A comparative study
CN114676698A (en) Equipment fault key information extraction method and system based on knowledge graph
Zhu et al. A Text Classification Algorithm for Power Equipment Defects Based on Random Forest
Zhang et al. A machine learning-based approach for building code requirement hierarchy extraction
Ravi et al. Substation transformer failure analysis through text mining
Wan et al. Evaluation model of power operation and maintenance based on text emotion analysis
Wei et al. Short text data model of secondary equipment faults in power grids based on LDA topic model and convolutional neural network
CN114283030A (en) Power distribution scheme recommendation method and device based on knowledge graph
Yamamoto Acquisition of lexical paraphrases from texts
Zhang et al. A construction method of electric power professional domain Corpus based on multi-model collaboration
Klubička et al. English wordnet random walk pseudo-corpora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant