CN109582787B

CN109582787B - Entity classification method and device for corpus data in thermal power generation field

Info

Publication number: CN109582787B
Application number: CN201811311803.3A
Authority: CN
Inventors: 唐静; 彭一轩; 解来甲
Original assignee: Yuanguang Software Co Ltd
Current assignee: Yuanguang Software Co Ltd
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2020-10-20
Anticipated expiration: 2038-11-05
Also published as: CN109582787A

Abstract

The invention relates to a method and a device for entity classification of corpus data in the field of thermal power generation, belonging to the technical field of thermal power generation, wherein the method comprises the steps of carrying out primary classification on a text set S to be classified containing the corpus data in the field of thermal power generation to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2; extracting entity new words in the unsuccessfully classified text set S2, and establishing an entity new word list E; and aligning the entity new words in the entity new word list with the successfully classified text set S1 one by one to confirm the entity category of the entity new words. The invention utilizes text data in the thermal power generation field, comprehensively adopts an unsupervised professional vocabulary discovery algorithm and a text classification algorithm, realizes the entity classification of the power generation corpus data, and the constructed thermal power generation professional lexicon can also be used for corpus support of text data mining in the field.

Description

Entity classification method and device for corpus data in thermal power generation field

Technical Field

The invention relates to the technical field of thermal power generation, in particular to an entity classification method and device for corpus data in the thermal power generation field.

Background

As typical non/semi-structured data, processing of text data has been one of the hot spots of data mining.

The text data analysis and mining in the thermal power generation field have great significance for regular defect inventory of thermal power generation enterprises and construction of enterprise knowledge maps of information construction of long-term enterprises, and the assistance of the enterprises in understanding the operation and health conditions of production equipment from the global level, carrying out multi-dimensional data fusion and mining of deep knowledge.

At present, the text data analysis and mining in the field of thermal power generation are still in the beginning stage. The main reason is that a complete corpus has not been established for the document data accumulated in the thermal power generation field, and many statistical machine learning methods are difficult to work under the condition that the corpus is insufficient. It is difficult to mine significant results from text using natural language processing.

The power generation enterprise mainly has an on-duty log and a defect record for daily operation record documents. When the entity classification is performed on the generating corpus data, the names of the devices in the daily records may have different expressions due to different personal habits, so that the corresponding records cannot be correctly classified when the classification is performed by using the standard device names.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a method and an apparatus for entity classification of corpus data in the field of thermal power generation, which combine a new word recognition method based on statistics with a classification algorithm to realize entity classification of corpus data of a power generation text.

The purpose of the invention is mainly realized by the following technical scheme:

an entity classification method for corpus data in the field of thermal power generation comprises the following steps:

performing primary classification on a text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;

extracting entity new words in the unsuccessfully classified text set S2 through the established alternative new word library, and establishing an entity new word list E;

carrying out entity alignment on the entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result;

and determining the entity category of the entity new word according to the obtained entity alignment result.

Further, the method for constructing the new candidate lexicon comprises the following steps:

establishing a field word library candidate word set;

quantizing the candidate words in the field word bank candidate word set;

threshold value screening is carried out on the quantized candidate words to form a domain word bank;

and after the general words in the field word stock are removed, a new alternative word stock is formed.

Further, the establishing of the domain thesaurus candidate word set includes:

preprocessing the corpus data in the thermal power generation field;

carrying out substring segmentation on the preprocessed corpus data to obtain substrings;

and performing word segmentation on the obtained substrings to form a candidate word set of a field word library.

Further, the quantification of the candidate words comprises quantification of word frequency, internal solidity, degree of freedom and position word forming probability.

Further, the threshold values set in the threshold value screening include a word frequency threshold value, a freezing degree threshold value, a left connecting word information entropy threshold value, a right connecting word information entropy threshold value and a position word forming probability threshold value.

Further, the primary classification may include, for example,

establishing a text set S to be classified: { s₁,s₂,···,s_i,···,s_m}，s_iA certain text record in the set;

establishing a logged entity equipment list N: { n₁,n₂,···,n_j,···,n_k}，n_jNumbering the class of a certain entity;

preprocessing the text to be classified including removing numbers and letters and splitting records;

classifying the preprocessed text set S according to the entity equipment list N to obtain a successfully classified document sample space S1{ Sn }₁:s₁₁,s₁₂,···；Sn_j:s_j1,s_j2,···；···；Sn_k:s_k1,s_k2K is the total number of entity classes in S1, Sn_jIs of entity class n_jA subset of documents.

Further, entity aligning the entity new word with the successfully classified text set S1, including;

establishing a document subset Se containing an entity new word list E, wherein the Se belongs to S2;

calculating the subset of documents Se to each subset of documents Sn in the successfully classified text set S1_jD (e, n) of_j) (ii) a E is an entity new word in the new word list E, n_jAn entity category for the successfully classified text set S1;

selecting the document subset Sn with the maximum occurrence frequency of the maximum distance d_jClassifying the entity new word e into the document subset Sn_jTo which entity class belongs.

Further, for entity new words which cannot be aligned with the entities, classifying the words by creating new entity categories; and adding the creating entity category into a logged entity equipment list N.

Further, the user finally confirms the entity new word list E containing the entity new word E and the entity category to which the entity new word E belongs through human-computer interaction.

An entity classification device for corpus data in the field of thermal power generation comprises a primary classification module, a new candidate word bank, a new word extraction module and an entity alignment module;

the primary classification module is used for carrying out primary classification on an input text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;

the alternative new word bank is used for storing entity new words in the field of thermal power generation;

the new word extraction module is respectively connected with the primary classification module and the alternative new word stock and is used for receiving an unsuccessful classification text set S2 input by the primary classification module, extracting entity new words in the unsuccessful classification text set S2 according to the content of the alternative new word stock and establishing an entity new word list E;

the entity alignment module is respectively connected with the primary classification module and the new word extraction module, and is configured to receive a successfully classified text set S1 output by the primary classification module and an entity new word list E output by the new word extraction module, and perform entity alignment on entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result; and determining the entity category of the entity new word according to the obtained entity alignment result.

The invention has the following beneficial effects:

the method is characterized in that text data in the thermal power generation field is utilized, an unsupervised professional vocabulary discovery algorithm and a text classification algorithm are comprehensively adopted, entity classification of the power generation corpus data is achieved, and the constructed thermal power generation professional lexicon can also be used for corpus support of text data mining in the field.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a flowchart of an entity classification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a connection of entity classification devices according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

The embodiment of the invention discloses an entity classification method of corpus data in the field of thermal power generation, which comprises the following steps as shown in figure 1:

step S1, performing primary classification on a text set S to be classified containing corpus data in the thermal power generation field;

1) establishing input data for classification;

the input data specifically includes:

and a text set S to be classified: { s₁,s₂,···,s_i,···,s_mIn which s is_iA certain text record in the set corresponds to a certain entity in the equipment entities, and m is the number of the text records;

registered entityDevice list N: { n₁,n₂,···,n_j,···,n_kIn which n is_jNumbering a category of a certain entity, wherein the category is formed by one or more names of equipment, and k is the total number of the entity equipment list;

2) preprocessing a text to be classified in the classified text set S;

in order to eliminate unnecessary information useless for classification, preprocessing measures such as removing numbers and letters and splitting records are carried out on the text to be classified, so that the text to be classified is simpler;

3) classifying the preprocessed text set S according to an entity equipment list N;

by classifying the set of texts S: { s₁,s₂,···,s_i,···,s_mClassifying to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;

the document sample space for the successfully classified text set S1 is { Sn₁:s₁₁,s₁₂,···；Sn_j:s_j1,s_j2,···；···；Sn_k:s_k1,s_k2K is the total number of entity classes in S1, Sn_jIs of entity class n_jA subset of documents.

S2, extracting entity new words in the unsuccessfully classified text set S2 through the established alternative new word library, and establishing an entity new word list E;

the method for establishing the alternative new word stock in the step comprises the following steps:

1) establishing a field word library candidate word set;

the method comprises the steps of establishing a field word bank candidate word set, wherein a thermal power generation field corpus data text accumulated by a thermal power generation enterprise can be adopted; the corpus data mainly comprises an on-duty log, a defect list and the like.

Preprocessing the accumulated corpus data text in the thermal power generation field; the specific preprocessing operation comprises the steps of carrying out deduplication processing on data and eliminating invalid characters comprising letters, symbols, numbers and the like which are definitely not entity words; and the corpus data of the subsequent processing is more concise.

For the preprocessed corpus data text, dividing sentences in the text into substrings by using signs such as spaces, line feed symbols and the like;

performing word segmentation on the sub-strings to form a candidate word set of a field word library;

specially, N-gram algorithm can be adopted to perform N-element segmentation on the substrings, and words in the substrings are segmented to obtain words including names of power generation equipment in the thermal power generation field, terms used by technicians in the field and professional descriptions of equipment faults in the field, so that a field word bank candidate word set is formed.

For example: the method comprises the following steps of carrying out N-gram algorithm on a corpus substring by adopting an N-gram algorithm, wherein the N-gram algorithm is used for carrying out N-element segmentation on the substring, namely, after the segmentation, a candidate word set is obtained, and the candidate word set comprises the following components:

high temperature

High temperature treatment

High temperature superheating

High-temperature superheater

After the high-temperature superheater

Over-temperature

Warm and hot water

Warm superheater

After the warm superheater

Rear pair of temperature superheater

2) Quantizing the candidate words in the field word bank candidate word set;

the quantitative quantization standard of the candidate words comprises word frequency, internal solidification degree, freedom degree and position word forming probability;

the degree of internal solidification is expressed by

Expressing, wherein x and y represent two different words in the corpus, and p (xy) represents the probability of x and y appearing in the corpus simultaneously; p (x) is the probability of x appearing in the corpus alone; p (y) is the probability of y appearing in the corpus alone; when pmi (x, y)>>At 0, it is shown that x and y are highly related, i.e. x and y often occur simultaneously, the more likely the string xy constitutes a new word.

The degree of freedom is measured by using the information entropy of the left connecting word and the right connecting word; that is, the degree of freedom is min (left conjunction word information entropy, right conjunction word information entropy);

in the formula, s_lThe left adjacent connecting character of the candidate word w; s_rThe word is a right adjacent connecting word of the candidate word w; p (w)_l| w) is w in the case of occurrence of the candidate word w_lThe conditional probability of (a); p (w)_r| w) is w as the right-adjacent concatenated word in the case of occurrence of the candidate word w_rThe conditional probability of (2).

The position word forming probability

Wherein i is c_iThe location where the word appears; n (c)_iI) is c_iThe frequency of all words appearing at the i position in the word; n (c)_i) Is c_iTotal frequency of occurrence in the corpus.

3) Threshold value screening is carried out on the quantized candidate words to form a domain word bank;

setting thresholds in the threshold screening, wherein the thresholds comprise a word frequency threshold, a freezing degree threshold, a left connecting word information entropy threshold, a right connecting word information entropy threshold and a position word forming probability threshold;

determining a degree of freedom threshold value by setting left and right connecting word information entropy threshold values;

judging and screening the words in the candidate word set by combining the set freezing degree threshold and the set freedom degree threshold to obtain the words applied in the field;

by setting a word frequency threshold, when the word frequency of the candidate word is greater than the threshold, the word is indicated to be a common word applied in the field, and the word is screened to form a field word stock;

by setting a position word forming probability threshold value, the word forming positions in the generated field word library are evaluated and judged, and the word forming accuracy is improved.

4) And comparing the field word stock with the general word stock, and forming a new alternative word stock after general words in the field word stock are removed.

The field word stock formed in the last step is not identified with professional words, the words in the word stock comprise general words used in the field, and the words are not related to equipment and do not need to be classified; therefore, by comparing the universal word stock with a universal word stock (a power plant professional word stock is available in 80 years of a power plant, and the universal word stock is a national standard universal word version), universal words in the field word stock are removed to form a new alternative word stock.

And comparing and extracting the words in the unsuccessfully classified text set S2 through the established alternative new word library, extracting entity new words which belong to the established alternative new word library and are contained in the unsuccessfully classified text set S2, and establishing an entity new word list E.

Particularly, in order to establish the entity new word list E more accurately, the classification result is finally confirmed by the user through human-computer interaction.

Step S3, aligning the entity new words in the entity new word list with the successfully classified text set S1 one by one; and confirming the entity category of the entity new word.

The specific alignment process comprises:

1) establishing a document subset Se containing an entity new word list E, wherein the Se belongs to S2;

2) calculating the subset of documents Se to each subset of documents Sn in the successfully classified text set S1_jD (e, n) of_j) (ii) a E is an entity new word in the new word list E, n_jAn entity category for the successfully classified text set S1;

3) selecting the document subset Sn with the maximum occurrence frequency of the maximum distance d_jClassifying the entity new word e into the document subset Sn_jThe entity class to which it belongs;

4) updating the subset of documents Sn of the successfully classified collection of text S1_jRepeating the above process until the document is printedThe subset Se is merged into the document subset Sn_j。

Particularly, due to the updating of the thermal power equipment, new equipment which is not logged in the entity equipment list N exists, and entity new words related to the new equipment cannot be aligned by the aligning process;

for entity new words which cannot be aligned, new entity categories are required to be created for classification; and adding the created entity category into the logged entity equipment list N.

Particularly, in order to enable the classification of the entity new word e to be more accurate, the classification result is finally confirmed by the user through human-computer interaction.

The embodiment of the invention also discloses an entity classification device of the corpus data in the thermal power generation field, which comprises a primary classification module, a new word alternative library, a new word extraction module and an entity alignment module, as shown in FIG. 2;

the alternative new word bank is used for storing entity new words included in the field of thermal power generation;

Optionally, the method for constructing the new candidate lexicon includes:

1) establishing a field word library candidate word set;

2) Quantizing the candidate words in the field word bank candidate word set;

the degree of internal solidification is expressed by

The position word forming probability

The field word stock formed in the last step is not identified with professional words, the words in the word stock comprise general words used in the field, and the words are not related to equipment and do not need to be classified; therefore, through comparison with the universal word bank, universal words in the field word bank are removed to form a new alternative word bank.

In summary, the method and the device for entity classification of the linguistic data in the thermal power generation field provided by the embodiment of the invention utilize the text data in the thermal power generation field, and comprehensively adopt the unsupervised professional vocabulary discovery algorithm and the text classification algorithm to realize entity classification of the linguistic data in the power generation field, and the constructed professional lexicon of the thermal power generation field can also be used for linguistic support of text data mining in the field.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. An entity classification method for corpus data in the field of thermal power generation is characterized by comprising the following steps:

the primary classification includes:

1) establishing input data for classification;

the input data specifically includes:

and a text set S to be classified: { s₁,s₂,···,s_i,···,s_mIn which s is_iIs a text record in the set corresponding to a certain entity in the equipment entity, and m is the number of the text records；

Registered entity device list N: { n₁,n₂,···,n_j,···,n_KIn which n is_jNumbering the category of a certain entity, wherein the category is formed by one or more names of equipment, and K is the total number of the entity equipment list;

2) preprocessing a text to be classified in the classified text set S;

in order to eliminate unnecessary information useless for classification, preprocessing measures including removal of numbers and letters and recording and splitting are carried out on the text to be classified, so that the text to be classified is simpler;

the document sample space for the successfully classified text set S1 is { Sn₁:s₁₁,s₁₂,···；Sn_j:s_j1,s_j2,···；···；Sn_k:s_k1,s_k2K is the total number of entity classes in S1, Sn_jIs of entity class n_jA subset of documents of (a);

the specific alignment process comprises:

2) calculating an entity new word e in the document subset Se to each document subset Sn in the successfully classified text set S1_jA distance d of;

3) selecting the document subset Sn with the maximum occurrence frequency of the maximum distance d_jClassifying the corresponding entity new word e into the textSet of gears Sn_jThe entity class to which it belongs;

4) updating the subset of documents Sn of the successfully classified collection of text S1_jRepeating the above process until the document subset Se is merged into the document subset Sn_j；

2. The entity classification method according to claim 1, wherein the construction method of the alternative new lexicon comprises:

establishing a field word library candidate word set;

quantizing the candidate words in the field word bank candidate word set;

3. The entity classification method according to claim 2, wherein the establishing of the domain thesaurus candidate word set comprises:

preprocessing the corpus data in the thermal power generation field;

4. The entity classification method according to claim 2, characterized in that the quantification of the candidate words comprises quantification of word frequency, internal solidity, degrees of freedom and position word formation probability.

5. The entity classification method according to claim 4, wherein the thresholds set in the threshold screening include a word frequency threshold, a freezing degree threshold, left and right conjunction word information entropy thresholds, and a position word forming probability threshold.

6. The entity classification method according to claim 1, characterized in that, for entity new words that cannot be entity aligned, classification is performed by creating a new entity class; and adding the new entity category into the logged entity equipment list N.

7. The entity classification method according to claim 6, wherein the entity new word list E containing the entity new word E and the entity category to which the entity new word E belongs are finally confirmed by the user through human-computer interaction.

8. An entity classification device for corpus data in the field of thermal power generation is characterized by comprising a primary classification module, a new word alternative library, a new word extraction module and an entity alignment module;

the primary classification includes:

1) establishing input data for classification;

the input data specifically includes:

2) preprocessing a text to be classified in the classified text set S;

the entity alignment module is respectively connected with the primary classification module and the new word extraction module, and is configured to receive a successfully classified text set S1 output by the primary classification module and an entity new word list E output by the new word extraction module, and perform entity alignment on entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result; determining the entity category of the entity new word according to the obtained entity alignment result;

the specific alignment process comprises:

3) selecting the document subset Sn with the maximum occurrence frequency of the maximum distance d_jClassifying the corresponding entity new word e into the documentSubset Sn_jThe entity class to which it belongs;

4) updating the subset of documents Sn of the successfully classified collection of text S1_jRepeating the above process until the document subset Se is merged into the document subset Sn_j。