CN109582787B - Entity classification method and device for corpus data in thermal power generation field - Google Patents

Entity classification method and device for corpus data in thermal power generation field Download PDF

Info

Publication number
CN109582787B
CN109582787B CN201811311803.3A CN201811311803A CN109582787B CN 109582787 B CN109582787 B CN 109582787B CN 201811311803 A CN201811311803 A CN 201811311803A CN 109582787 B CN109582787 B CN 109582787B
Authority
CN
China
Prior art keywords
entity
word
classified
text
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811311803.3A
Other languages
Chinese (zh)
Other versions
CN109582787A (en
Inventor
唐静
彭一轩
解来甲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanguang Software Co Ltd
Original Assignee
Yuanguang Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanguang Software Co Ltd filed Critical Yuanguang Software Co Ltd
Priority to CN201811311803.3A priority Critical patent/CN109582787B/en
Publication of CN109582787A publication Critical patent/CN109582787A/en
Application granted granted Critical
Publication of CN109582787B publication Critical patent/CN109582787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for entity classification of corpus data in the field of thermal power generation, belonging to the technical field of thermal power generation, wherein the method comprises the steps of carrying out primary classification on a text set S to be classified containing the corpus data in the field of thermal power generation to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2; extracting entity new words in the unsuccessfully classified text set S2, and establishing an entity new word list E; and aligning the entity new words in the entity new word list with the successfully classified text set S1 one by one to confirm the entity category of the entity new words. The invention utilizes text data in the thermal power generation field, comprehensively adopts an unsupervised professional vocabulary discovery algorithm and a text classification algorithm, realizes the entity classification of the power generation corpus data, and the constructed thermal power generation professional lexicon can also be used for corpus support of text data mining in the field.

Description

Entity classification method and device for corpus data in thermal power generation field
Technical Field
The invention relates to the technical field of thermal power generation, in particular to an entity classification method and device for corpus data in the thermal power generation field.
Background
As typical non/semi-structured data, processing of text data has been one of the hot spots of data mining.
The text data analysis and mining in the thermal power generation field have great significance for regular defect inventory of thermal power generation enterprises and construction of enterprise knowledge maps of information construction of long-term enterprises, and the assistance of the enterprises in understanding the operation and health conditions of production equipment from the global level, carrying out multi-dimensional data fusion and mining of deep knowledge.
At present, the text data analysis and mining in the field of thermal power generation are still in the beginning stage. The main reason is that a complete corpus has not been established for the document data accumulated in the thermal power generation field, and many statistical machine learning methods are difficult to work under the condition that the corpus is insufficient. It is difficult to mine significant results from text using natural language processing.
The power generation enterprise mainly has an on-duty log and a defect record for daily operation record documents. When the entity classification is performed on the generating corpus data, the names of the devices in the daily records may have different expressions due to different personal habits, so that the corresponding records cannot be correctly classified when the classification is performed by using the standard device names.
Disclosure of Invention
In view of the above analysis, the present invention aims to provide a method and an apparatus for entity classification of corpus data in the field of thermal power generation, which combine a new word recognition method based on statistics with a classification algorithm to realize entity classification of corpus data of a power generation text.
The purpose of the invention is mainly realized by the following technical scheme:
an entity classification method for corpus data in the field of thermal power generation comprises the following steps:
performing primary classification on a text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
extracting entity new words in the unsuccessfully classified text set S2 through the established alternative new word library, and establishing an entity new word list E;
carrying out entity alignment on the entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result;
and determining the entity category of the entity new word according to the obtained entity alignment result.
Further, the method for constructing the new candidate lexicon comprises the following steps:
establishing a field word library candidate word set;
quantizing the candidate words in the field word bank candidate word set;
threshold value screening is carried out on the quantized candidate words to form a domain word bank;
and after the general words in the field word stock are removed, a new alternative word stock is formed.
Further, the establishing of the domain thesaurus candidate word set includes:
preprocessing the corpus data in the thermal power generation field;
carrying out substring segmentation on the preprocessed corpus data to obtain substrings;
and performing word segmentation on the obtained substrings to form a candidate word set of a field word library.
Further, the quantification of the candidate words comprises quantification of word frequency, internal solidity, degree of freedom and position word forming probability.
Further, the threshold values set in the threshold value screening include a word frequency threshold value, a freezing degree threshold value, a left connecting word information entropy threshold value, a right connecting word information entropy threshold value and a position word forming probability threshold value.
Further, the primary classification may include, for example,
establishing a text set S to be classified: { s1,s2,···,si,···,sm},siA certain text record in the set;
establishing a logged entity equipment list N: { n1,n2,···,nj,···,nk},njNumbering the class of a certain entity;
preprocessing the text to be classified including removing numbers and letters and splitting records;
classifying the preprocessed text set S according to the entity equipment list N to obtain a successfully classified document sample space S1{ Sn }1:s11,s12,···;Snj:sj1,sj2,···;···;Snk:sk1,sk2K is the total number of entity classes in S1, SnjIs of entity class njA subset of documents.
Further, entity aligning the entity new word with the successfully classified text set S1, including;
establishing a document subset Se containing an entity new word list E, wherein the Se belongs to S2;
calculating the subset of documents Se to each subset of documents Sn in the successfully classified text set S1jD (e, n) ofj) (ii) a E is an entity new word in the new word list E, njAn entity category for the successfully classified text set S1;
selecting the document subset Sn with the maximum occurrence frequency of the maximum distance djClassifying the entity new word e into the document subset SnjTo which entity class belongs.
Further, for entity new words which cannot be aligned with the entities, classifying the words by creating new entity categories; and adding the creating entity category into a logged entity equipment list N.
Further, the user finally confirms the entity new word list E containing the entity new word E and the entity category to which the entity new word E belongs through human-computer interaction.
An entity classification device for corpus data in the field of thermal power generation comprises a primary classification module, a new candidate word bank, a new word extraction module and an entity alignment module;
the primary classification module is used for carrying out primary classification on an input text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the alternative new word bank is used for storing entity new words in the field of thermal power generation;
the new word extraction module is respectively connected with the primary classification module and the alternative new word stock and is used for receiving an unsuccessful classification text set S2 input by the primary classification module, extracting entity new words in the unsuccessful classification text set S2 according to the content of the alternative new word stock and establishing an entity new word list E;
the entity alignment module is respectively connected with the primary classification module and the new word extraction module, and is configured to receive a successfully classified text set S1 output by the primary classification module and an entity new word list E output by the new word extraction module, and perform entity alignment on entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result; and determining the entity category of the entity new word according to the obtained entity alignment result.
The invention has the following beneficial effects:
the method is characterized in that text data in the thermal power generation field is utilized, an unsupervised professional vocabulary discovery algorithm and a text classification algorithm are comprehensively adopted, entity classification of the power generation corpus data is achieved, and the constructed thermal power generation professional lexicon can also be used for corpus support of text data mining in the field.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flowchart of an entity classification method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a connection of entity classification devices according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.
The embodiment of the invention discloses an entity classification method of corpus data in the field of thermal power generation, which comprises the following steps as shown in figure 1:
step S1, performing primary classification on a text set S to be classified containing corpus data in the thermal power generation field;
1) establishing input data for classification;
the input data specifically includes:
and a text set S to be classified: { s1,s2,···,si,···,smIn which s isiA certain text record in the set corresponds to a certain entity in the equipment entities, and m is the number of the text records;
registered entityDevice list N: { n1,n2,···,nj,···,nkIn which n isjNumbering a category of a certain entity, wherein the category is formed by one or more names of equipment, and k is the total number of the entity equipment list;
2) preprocessing a text to be classified in the classified text set S;
in order to eliminate unnecessary information useless for classification, preprocessing measures such as removing numbers and letters and splitting records are carried out on the text to be classified, so that the text to be classified is simpler;
3) classifying the preprocessed text set S according to an entity equipment list N;
by classifying the set of texts S: { s1,s2,···,si,···,smClassifying to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the document sample space for the successfully classified text set S1 is { Sn1:s11,s12,···;Snj:sj1,sj2,···;···;Snk:sk1,sk2K is the total number of entity classes in S1, SnjIs of entity class njA subset of documents.
S2, extracting entity new words in the unsuccessfully classified text set S2 through the established alternative new word library, and establishing an entity new word list E;
the method for establishing the alternative new word stock in the step comprises the following steps:
1) establishing a field word library candidate word set;
the method comprises the steps of establishing a field word bank candidate word set, wherein a thermal power generation field corpus data text accumulated by a thermal power generation enterprise can be adopted; the corpus data mainly comprises an on-duty log, a defect list and the like.
Preprocessing the accumulated corpus data text in the thermal power generation field; the specific preprocessing operation comprises the steps of carrying out deduplication processing on data and eliminating invalid characters comprising letters, symbols, numbers and the like which are definitely not entity words; and the corpus data of the subsequent processing is more concise.
For the preprocessed corpus data text, dividing sentences in the text into substrings by using signs such as spaces, line feed symbols and the like;
performing word segmentation on the sub-strings to form a candidate word set of a field word library;
specially, N-gram algorithm can be adopted to perform N-element segmentation on the substrings, and words in the substrings are segmented to obtain words including names of power generation equipment in the thermal power generation field, terms used by technicians in the field and professional descriptions of equipment faults in the field, so that a field word bank candidate word set is formed.
For example: the method comprises the following steps of carrying out N-gram algorithm on a corpus substring by adopting an N-gram algorithm, wherein the N-gram algorithm is used for carrying out N-element segmentation on the substring, namely, after the segmentation, a candidate word set is obtained, and the candidate word set comprises the following components:
high temperature
High temperature treatment
High temperature superheating
High-temperature superheater
After the high-temperature superheater
Over-temperature
Warm and hot water
Warm superheater
After the warm superheater
Rear pair of temperature superheater
2) Quantizing the candidate words in the field word bank candidate word set;
the quantitative quantization standard of the candidate words comprises word frequency, internal solidification degree, freedom degree and position word forming probability;
the degree of internal solidification is expressed by
Figure GDA0002525672530000071
Expressing, wherein x and y represent two different words in the corpus, and p (xy) represents the probability of x and y appearing in the corpus simultaneously; p (x) is the probability of x appearing in the corpus alone; p (y) is the probability of y appearing in the corpus alone; when pmi (x, y)>>At 0, it is shown that x and y are highly related, i.e. x and y often occur simultaneously, the more likely the string xy constitutes a new word.
The degree of freedom is measured by using the information entropy of the left connecting word and the right connecting word; that is, the degree of freedom is min (left conjunction word information entropy, right conjunction word information entropy);
Figure GDA0002525672530000072
Figure GDA0002525672530000073
in the formula, slThe left adjacent connecting character of the candidate word w; srThe word is a right adjacent connecting word of the candidate word w; p (w)l| w) is w in the case of occurrence of the candidate word wlThe conditional probability of (a); p (w)r| w) is w as the right-adjacent concatenated word in the case of occurrence of the candidate word wrThe conditional probability of (2).
The position word forming probability
Figure GDA0002525672530000074
Wherein i is ciThe location where the word appears; n (c)iI) is ciThe frequency of all words appearing at the i position in the word; n (c)i) Is ciTotal frequency of occurrence in the corpus.
3) Threshold value screening is carried out on the quantized candidate words to form a domain word bank;
setting thresholds in the threshold screening, wherein the thresholds comprise a word frequency threshold, a freezing degree threshold, a left connecting word information entropy threshold, a right connecting word information entropy threshold and a position word forming probability threshold;
determining a degree of freedom threshold value by setting left and right connecting word information entropy threshold values;
judging and screening the words in the candidate word set by combining the set freezing degree threshold and the set freedom degree threshold to obtain the words applied in the field;
by setting a word frequency threshold, when the word frequency of the candidate word is greater than the threshold, the word is indicated to be a common word applied in the field, and the word is screened to form a field word stock;
by setting a position word forming probability threshold value, the word forming positions in the generated field word library are evaluated and judged, and the word forming accuracy is improved.
4) And comparing the field word stock with the general word stock, and forming a new alternative word stock after general words in the field word stock are removed.
The field word stock formed in the last step is not identified with professional words, the words in the word stock comprise general words used in the field, and the words are not related to equipment and do not need to be classified; therefore, by comparing the universal word stock with a universal word stock (a power plant professional word stock is available in 80 years of a power plant, and the universal word stock is a national standard universal word version), universal words in the field word stock are removed to form a new alternative word stock.
And comparing and extracting the words in the unsuccessfully classified text set S2 through the established alternative new word library, extracting entity new words which belong to the established alternative new word library and are contained in the unsuccessfully classified text set S2, and establishing an entity new word list E.
Particularly, in order to establish the entity new word list E more accurately, the classification result is finally confirmed by the user through human-computer interaction.
Step S3, aligning the entity new words in the entity new word list with the successfully classified text set S1 one by one; and confirming the entity category of the entity new word.
The specific alignment process comprises:
1) establishing a document subset Se containing an entity new word list E, wherein the Se belongs to S2;
2) calculating the subset of documents Se to each subset of documents Sn in the successfully classified text set S1jD (e, n) ofj) (ii) a E is an entity new word in the new word list E, njAn entity category for the successfully classified text set S1;
3) selecting the document subset Sn with the maximum occurrence frequency of the maximum distance djClassifying the entity new word e into the document subset SnjThe entity class to which it belongs;
4) updating the subset of documents Sn of the successfully classified collection of text S1jRepeating the above process until the document is printedThe subset Se is merged into the document subset Snj
Particularly, due to the updating of the thermal power equipment, new equipment which is not logged in the entity equipment list N exists, and entity new words related to the new equipment cannot be aligned by the aligning process;
for entity new words which cannot be aligned, new entity categories are required to be created for classification; and adding the created entity category into the logged entity equipment list N.
Particularly, in order to enable the classification of the entity new word e to be more accurate, the classification result is finally confirmed by the user through human-computer interaction.
The embodiment of the invention also discloses an entity classification device of the corpus data in the thermal power generation field, which comprises a primary classification module, a new word alternative library, a new word extraction module and an entity alignment module, as shown in FIG. 2;
the primary classification module is used for carrying out primary classification on an input text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the alternative new word bank is used for storing entity new words included in the field of thermal power generation;
the new word extraction module is respectively connected with the primary classification module and the alternative new word stock and is used for receiving an unsuccessful classification text set S2 input by the primary classification module, extracting entity new words in the unsuccessful classification text set S2 according to the content of the alternative new word stock and establishing an entity new word list E;
the entity alignment module is respectively connected with the primary classification module and the new word extraction module, and is configured to receive a successfully classified text set S1 output by the primary classification module and an entity new word list E output by the new word extraction module, and perform entity alignment on entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result; and determining the entity category of the entity new word according to the obtained entity alignment result.
Optionally, the method for constructing the new candidate lexicon includes:
1) establishing a field word library candidate word set;
the method comprises the steps of establishing a field word bank candidate word set, wherein a thermal power generation field corpus data text accumulated by a thermal power generation enterprise can be adopted; the corpus data mainly comprises an on-duty log, a defect list and the like.
Preprocessing the accumulated corpus data text in the thermal power generation field; the specific preprocessing operation comprises the steps of carrying out deduplication processing on data and eliminating invalid characters comprising letters, symbols, numbers and the like which are definitely not entity words; and the corpus data of the subsequent processing is more concise.
For the preprocessed corpus data text, dividing sentences in the text into substrings by using signs such as spaces, line feed symbols and the like;
performing word segmentation on the sub-strings to form a candidate word set of a field word library;
specially, N-gram algorithm can be adopted to perform N-element segmentation on the substrings, and words in the substrings are segmented to obtain words including names of power generation equipment in the thermal power generation field, terms used by technicians in the field and professional descriptions of equipment faults in the field, so that a field word bank candidate word set is formed.
2) Quantizing the candidate words in the field word bank candidate word set;
the quantitative quantization standard of the candidate words comprises word frequency, internal solidification degree, freedom degree and position word forming probability;
the degree of internal solidification is expressed by
Figure GDA0002525672530000101
Expressing, wherein x and y represent two different words in the corpus, and p (xy) represents the probability of x and y appearing in the corpus simultaneously; p (x) is the probability of x appearing in the corpus alone; p (y) is the probability of y appearing in the corpus alone; when pmi (x, y)>>At 0, it is shown that x and y are highly related, i.e. x and y often occur simultaneously, the more likely the string xy constitutes a new word.
The degree of freedom is measured by using the information entropy of the left connecting word and the right connecting word; that is, the degree of freedom is min (left conjunction word information entropy, right conjunction word information entropy);
Figure GDA0002525672530000102
Figure GDA0002525672530000103
in the formula, slThe left adjacent connecting character of the candidate word w; srThe word is a right adjacent connecting word of the candidate word w; p (w)l| w) is w in the case of occurrence of the candidate word wlThe conditional probability of (a); p (w)r| w) is w as the right-adjacent concatenated word in the case of occurrence of the candidate word wrThe conditional probability of (2).
The position word forming probability
Figure GDA0002525672530000111
Wherein i is ciThe location where the word appears; n (c)iI) is ciThe frequency of all words appearing at the i position in the word; n (c)i) Is ciTotal frequency of occurrence in the corpus.
3) Threshold value screening is carried out on the quantized candidate words to form a domain word bank;
setting thresholds in the threshold screening, wherein the thresholds comprise a word frequency threshold, a freezing degree threshold, a left connecting word information entropy threshold, a right connecting word information entropy threshold and a position word forming probability threshold;
determining a degree of freedom threshold value by setting left and right connecting word information entropy threshold values;
judging and screening the words in the candidate word set by combining the set freezing degree threshold and the set freedom degree threshold to obtain the words applied in the field;
by setting a word frequency threshold, when the word frequency of the candidate word is greater than the threshold, the word is indicated to be a common word applied in the field, and the word is screened to form a field word stock;
by setting a position word forming probability threshold value, the word forming positions in the generated field word library are evaluated and judged, and the word forming accuracy is improved.
4) And comparing the field word stock with the general word stock, and forming a new alternative word stock after general words in the field word stock are removed.
The field word stock formed in the last step is not identified with professional words, the words in the word stock comprise general words used in the field, and the words are not related to equipment and do not need to be classified; therefore, through comparison with the universal word bank, universal words in the field word bank are removed to form a new alternative word bank.
In summary, the method and the device for entity classification of the linguistic data in the thermal power generation field provided by the embodiment of the invention utilize the text data in the thermal power generation field, and comprehensively adopt the unsupervised professional vocabulary discovery algorithm and the text classification algorithm to realize entity classification of the linguistic data in the power generation field, and the constructed professional lexicon of the thermal power generation field can also be used for linguistic support of text data mining in the field.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (8)

1. An entity classification method for corpus data in the field of thermal power generation is characterized by comprising the following steps:
performing primary classification on a text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the primary classification includes:
1) establishing input data for classification;
the input data specifically includes:
and a text set S to be classified: { s1,s2,···,si,···,smIn which s isiIs a text record in the set corresponding to a certain entity in the equipment entity, and m is the number of the text records;
Registered entity device list N: { n1,n2,···,nj,···,nKIn which n isjNumbering the category of a certain entity, wherein the category is formed by one or more names of equipment, and K is the total number of the entity equipment list;
2) preprocessing a text to be classified in the classified text set S;
in order to eliminate unnecessary information useless for classification, preprocessing measures including removal of numbers and letters and recording and splitting are carried out on the text to be classified, so that the text to be classified is simpler;
3) classifying the preprocessed text set S according to an entity equipment list N;
by classifying the set of texts S: { s1,s2,···,si,···,smClassifying to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the document sample space for the successfully classified text set S1 is { Sn1:s11,s12,···;Snj:sj1,sj2,···;···;Snk:sk1,sk2K is the total number of entity classes in S1, SnjIs of entity class njA subset of documents of (a);
extracting entity new words in the unsuccessfully classified text set S2 through the established alternative new word library, and establishing an entity new word list E;
carrying out entity alignment on the entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result;
the specific alignment process comprises:
1) establishing a document subset Se containing an entity new word list E, wherein the Se belongs to S2;
2) calculating an entity new word e in the document subset Se to each document subset Sn in the successfully classified text set S1jA distance d of;
3) selecting the document subset Sn with the maximum occurrence frequency of the maximum distance djClassifying the corresponding entity new word e into the textSet of gears SnjThe entity class to which it belongs;
4) updating the subset of documents Sn of the successfully classified collection of text S1jRepeating the above process until the document subset Se is merged into the document subset Snj
And determining the entity category of the entity new word according to the obtained entity alignment result.
2. The entity classification method according to claim 1, wherein the construction method of the alternative new lexicon comprises:
establishing a field word library candidate word set;
quantizing the candidate words in the field word bank candidate word set;
threshold value screening is carried out on the quantized candidate words to form a domain word bank;
and after the general words in the field word stock are removed, a new alternative word stock is formed.
3. The entity classification method according to claim 2, wherein the establishing of the domain thesaurus candidate word set comprises:
preprocessing the corpus data in the thermal power generation field;
carrying out substring segmentation on the preprocessed corpus data to obtain substrings;
and performing word segmentation on the obtained substrings to form a candidate word set of a field word library.
4. The entity classification method according to claim 2, characterized in that the quantification of the candidate words comprises quantification of word frequency, internal solidity, degrees of freedom and position word formation probability.
5. The entity classification method according to claim 4, wherein the thresholds set in the threshold screening include a word frequency threshold, a freezing degree threshold, left and right conjunction word information entropy thresholds, and a position word forming probability threshold.
6. The entity classification method according to claim 1, characterized in that, for entity new words that cannot be entity aligned, classification is performed by creating a new entity class; and adding the new entity category into the logged entity equipment list N.
7. The entity classification method according to claim 6, wherein the entity new word list E containing the entity new word E and the entity category to which the entity new word E belongs are finally confirmed by the user through human-computer interaction.
8. An entity classification device for corpus data in the field of thermal power generation is characterized by comprising a primary classification module, a new word alternative library, a new word extraction module and an entity alignment module;
the primary classification module is used for carrying out primary classification on an input text set S to be classified containing corpus data in the thermal power generation field to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the primary classification includes:
1) establishing input data for classification;
the input data specifically includes:
and a text set S to be classified: { s1,s2,···,si,···,smIn which s isiA certain text record in the set corresponds to a certain entity in the equipment entities, and m is the number of the text records;
registered entity device list N: { n1,n2,···,nj,···,nKIn which n isjNumbering the category of a certain entity, wherein the category is formed by one or more names of equipment, and K is the total number of the entity equipment list;
2) preprocessing a text to be classified in the classified text set S;
in order to eliminate unnecessary information useless for classification, preprocessing measures including removal of numbers and letters and recording and splitting are carried out on the text to be classified, so that the text to be classified is simpler;
3) classifying the preprocessed text set S according to an entity equipment list N;
by classifying the set of texts S: { s1,s2,···,si,···,smClassifying to obtain a successfully classified text set S1 and an unsuccessfully classified text set S2;
the document sample space for the successfully classified text set S1 is { Sn1:s11,s12,···;Snj:sj1,sj2,···;···;Snk:sk1,sk2K is the total number of entity classes in S1, SnjIs of entity class njA subset of documents of (a);
the alternative new word bank is used for storing entity new words in the field of thermal power generation;
the new word extraction module is respectively connected with the primary classification module and the alternative new word stock and is used for receiving an unsuccessful classification text set S2 input by the primary classification module, extracting entity new words in the unsuccessful classification text set S2 according to the content of the alternative new word stock and establishing an entity new word list E;
the entity alignment module is respectively connected with the primary classification module and the new word extraction module, and is configured to receive a successfully classified text set S1 output by the primary classification module and an entity new word list E output by the new word extraction module, and perform entity alignment on entity new words in the entity new word list E and the successfully classified text set S1 one by one to obtain an entity alignment result; determining the entity category of the entity new word according to the obtained entity alignment result;
the specific alignment process comprises:
1) establishing a document subset Se containing an entity new word list E, wherein the Se belongs to S2;
2) calculating an entity new word e in the document subset Se to each document subset Sn in the successfully classified text set S1jA distance d of;
3) selecting the document subset Sn with the maximum occurrence frequency of the maximum distance djClassifying the corresponding entity new word e into the documentSubset SnjThe entity class to which it belongs;
4) updating the subset of documents Sn of the successfully classified collection of text S1jRepeating the above process until the document subset Se is merged into the document subset Snj
CN201811311803.3A 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field Active CN109582787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811311803.3A CN109582787B (en) 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811311803.3A CN109582787B (en) 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field

Publications (2)

Publication Number Publication Date
CN109582787A CN109582787A (en) 2019-04-05
CN109582787B true CN109582787B (en) 2020-10-20

Family

ID=65921571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811311803.3A Active CN109582787B (en) 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field

Country Status (1)

Country Link
CN (1) CN109582787B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852109A (en) * 2019-11-11 2020-02-28 腾讯科技(深圳)有限公司 Corpus generating method, corpus generating device, and storage medium
CN112948570A (en) * 2019-12-11 2021-06-11 复旦大学 Unsupervised automatic domain knowledge map construction system
CN111177403B (en) * 2019-12-16 2023-06-23 恩亿科(北京)数据科技有限公司 Sample data processing method and device
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN113468332A (en) * 2021-07-14 2021-10-01 广州华多网络科技有限公司 Classification model updating method and corresponding device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6138087A (en) * 1994-09-30 2000-10-24 Budzinski; Robert L. Memory system for storing and retrieving experience and knowledge with natural language utilizing state representation data, word sense numbers, function codes and/or directed graphs
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106447346A (en) * 2016-08-29 2017-02-22 北京中电普华信息技术有限公司 Method and system for construction of intelligent electric power customer service system
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108363691A (en) * 2018-02-09 2018-08-03 国网江苏省电力有限公司电力科学研究院 A kind of field term identifying system and method for 95598 work order of electric power

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6138087A (en) * 1994-09-30 2000-10-24 Budzinski; Robert L. Memory system for storing and retrieving experience and knowledge with natural language utilizing state representation data, word sense numbers, function codes and/or directed graphs
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106447346A (en) * 2016-08-29 2017-02-22 北京中电普华信息技术有限公司 Method and system for construction of intelligent electric power customer service system
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108363691A (en) * 2018-02-09 2018-08-03 国网江苏省电力有限公司电力科学研究院 A kind of field term identifying system and method for 95598 work order of electric power

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
电力中文文本数据挖掘技术及其在可靠性中的应用研究;邱剑;《中国博士学位论文全文数据库工程科技Ⅱ辑》;20170715(第7期);第C042-37页 *

Also Published As

Publication number Publication date
CN109582787A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN109582787B (en) Entity classification method and device for corpus data in thermal power generation field
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
CN112001177A (en) Electronic medical record named entity identification method and system integrating deep learning and rules
Schmaltz et al. Adapting sequence models for sentence correction
CN113011533A (en) Text classification method and device, computer equipment and storage medium
Hämäläinen et al. From the paft to the fiiture: a fully automatic NMT and word embeddings method for OCR post-correction
CN110727880A (en) Sensitive corpus detection method based on word bank and word vector model
WO2009035863A2 (en) Mining bilingual dictionaries from monolingual web pages
CN112507065A (en) Code searching method based on annotation semantic information
CN107229627B (en) Text processing method and device and computing equipment
CN1834955A (en) Multilingual translation memory, translation method, and translation program
CN113159969B (en) Financial long text rechecking system
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN110555206A (en) named entity identification method, device, equipment and storage medium
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
Valarakos et al. Enhancing ontological knowledge through ontology population and enrichment
CN110781333A (en) Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN110889275A (en) Information extraction method based on deep semantic understanding
CN111104801A (en) Text word segmentation method, system, device and medium based on website domain name
CN110704638A (en) Clustering algorithm-based electric power text dictionary construction method
El-Haj et al. Multilingual financial narrative processing: Analyzing annual reports in english, spanish, and portuguese
CN114647715A (en) Entity recognition method based on pre-training language model
CN114266256A (en) Method and system for extracting new words in field
CN117763153B (en) Method and system for finding new words by topic corpus
CN117688488A (en) Log anomaly detection method based on semantic vectorization representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant