CN111831788A

CN111831788A - Electric power corpus marking model construction method and system

Info

Publication number: CN111831788A
Application number: CN202010547313.4A
Authority: CN
Inventors: 田然; 苏杨; 庞渊源; 陈轩; 顾彬; 王磊; 吴子辰; 李霁轩; 陈鑫; 朱晓鸿; 王鑫; 于广荣
Original assignee: Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2020-10-27

Abstract

The invention discloses a method for constructing a labeled model of an electric power corpus, which comprises the steps of expanding labeled electric power corpus in a training set and training an initial labeled model; judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement. A corresponding system is also disclosed. The method expands the marked electric power corpora in the training set, marks the electric power corpora by using the marking model which does not meet the preset requirement, adds the marked electric power corpora into the training set, expands the training set by adopting a snowball rolling mode, and greatly reduces the labor cost.

Description

Electric power corpus marking model construction method and system

Technical Field

The invention belongs to the field of text corpus marking in the field of electric power, and relates to a method and a system for constructing an electric power corpus marking model.

Background

In the construction of the text corpus in the power field, the most important thing is the labeling process of the real corpus. In the early days, all data need to be marked manually, the cost is high, and the consistency of marking is poor. In order to solve the problems of the traditional method, a semi-automatic labeling method is often used, wherein the most common method for active learning is the active learning method, a learner is used for automatic corpus labeling, and a selector selects a part from unlabeled samples through a certain algorithm and hands the part to an expert for manual labeling, and the part is added into a corpus to train the learner. However, on the complex problems of knowledge extraction, there are few simple corpus samples, so that the samples selected by the method still need more time and labor cost for labeling, and the training set of the learner cannot be expanded on a large scale.

Disclosure of Invention

In view of this, the invention provides a method and a system for constructing an electric corpus tagging model, which solve the problem that the existing method cannot expand a training set in a large scale.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for constructing a power corpus tagging model comprises,

expanding the marked electric power corpora in the training set, and training an initial marking model;

judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement.

The marked electric power corpora in the training set are expanded, and the specific process is that,

taking a sentence as a sequence, and carrying out sequence division on the marked electric power linguistic data;

based on a synonym dictionary and a professional dictionary in the electric power field, replacing synonyms in the sequence, and adding the replaced sequence into a training set;

the original sequence and the replaced sequence are translated back, and the translated back sequence is added into a training set;

cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set;

and splicing a plurality of sequences after the translation with the length smaller than the minimum threshold value, and adding the spliced sequences into a training set.

Sequence preservation before cropping and splicing.

And (5) retraining the marking model by adopting an incremental learning mode.

A system for constructing a power corpus tagging model comprises,

an initial module: expanding the marked electric power corpora in the training set, and training an initial marking model;

an incremental training module: judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement.

The initial module comprises a training sample extension module which is used for: expanding the marked electric power corpora in the training set;

the training sample expansion module comprises a training sample expansion module,

a sequence division module: taking a sentence as a sequence, and carrying out sequence division on the marked electric power linguistic data;

a synonym replacement module: based on a synonym dictionary and a professional dictionary in the electric power field, replacing synonyms in the sequence, and adding the replaced sequence into a training set;

a translation module: the original sequence and the replaced sequence are translated back, and the translated back sequence is added into a training set;

a cutting module: cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set;

splicing modules: and splicing a plurality of sequences after the translation with the length smaller than the minimum threshold value, and adding the spliced sequences into a training set.

The cutting module reserves a sequence before cutting, and the splicing module reserves a sequence before splicing.

And the increment training module retrains the marking model in an increment learning mode.

A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform a power corpus tagging model building method.

A computing device comprising one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing a method of electric corpus tagging model construction.

The invention achieves the following beneficial effects: the method expands the marked electric power corpora in the training set, marks the electric power corpora by using the marking model which does not meet the preset requirement, adds the marked electric power corpora into the training set, expands the training set by adopting a snowball rolling mode, and greatly reduces the labor cost.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1, a method for constructing an electric corpus tagging model includes the following steps:

step 1, expanding marked electric power corpora in a training set, and training an initial marking model; wherein the electric power corpus is unstructured corpus.

Expanding the marked electric power corpora in the training set, wherein the specific process is as follows:

s1) performing sequence division on the marked electric power corpus by taking a complete sentence as a sequence.

S2) replacing synonyms in the sequence based on the synonym dictionary and the electric power field professional dictionary, and adding the replaced sequence into the training set.

Synonym substitutions preserve the original tags of the sequences, and for each sequence, 2-6 sequences are available, for example: "the king of the ministry of development of power grid of Jiangsu dynasty, Asahi means that the electricity is stored on a large scale and flexibly and stably called, which is always the direction of effort of the power system. "the power grid of Jiangsu" is replaced by "the power saving Limited company of Jiangsu of China network", and the expression of the mastership of the development department of the power saving Limited company of Jiangsu of China network, namely the mastership of the Ministry of Johnson of China who is who you Xue, the power is stored in a large scale, and the flexible and stable calling is carried out, which is always the direction of the effort of the. ".

S3) carrying out one or more times of retranslation on the original sequence and the sequence after the replacement in the step S2, and adding the retranslated sequence into a training set.

For example, in the embodiment, the translation method is to use english as a pivot language, firstly translate a chinese sentence into an english sentence through a machine translation operation, and then translate the english sentence into a chinese sentence through a machine translation.

For example: "the king of the ministry of development of power grid of Jiangsu dynasty, Asahi means that the electricity is stored on a large scale and flexibly and stably called, which is always the direction of effort of the power system. "the sentence is put into the translator to be translated back for many times, and a new text" the national power grid Jiangsu electric power Limited company development department Master-ren-Xu shows that the realization of the flexible and stable transmission of the large-capacity energy storage is always the direction of the effort of the electric power system. ".

After the translation back, the sequences with the same expression meaning, the same knowledge, the same key words and different sentence patterns are obtained, and the corresponding segments of the translated back sequences also keep the marks of the original sequences.

S4) cutting the translated sequence with the length larger than the maximum threshold value, adding the cut sequence into a training set, and reserving the original sequence; the maximum threshold value is set manually according to actual conditions.

When cutting the long sequence, the long sequence is cut into 2 or more than 2 sequences without cutting knowledge, and the original long sequence is still kept.

S5) splicing a plurality of sequences after being translated back and with the length smaller than a minimum threshold value, adding the spliced sequences into a training set, and reserving the original sequences; the minimum threshold is set artificially according to actual conditions.

Splicing 2 or more than 2 short sequences into one sequence, and simultaneously keeping the original short sequences.

On one hand, the problem of insufficient training data is solved by cutting and splicing the sequence, and meanwhile, the situation that the long sequence and the short sequence are not distributed uniformly is also solved, and the situation that the position information of the model is not sufficiently learned is avoided. Example (c): sequence cutting establishes the long sequence of' national network Jiangsu Power saving Limited company in 1988, 12 and 25 days, and is registered in Nanjing Shanghai No. 215. "tailor" national network Jiangsu Power saving Limited, established in 1988 in 12 months and 25 days. "and" national network Jiangsu Power saving Limited company is registered in Nanjing Shanghai No. 215. ", sequence splicing works similarly.

And 2, judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpora by using the marking model, putting the marked electric power corpora into a training set, expanding the marked electric power corpora in the training set, retraining the marking model by adopting an incremental learning mode, and repeating the step until the trained marking model meets the preset requirement.

And judging whether the trained marking model meets the preset requirement, namely judging whether the marking model meets the expectation or whether a lifting space exists. If the expectation is not reached or the lifting space is still available, the mark model is retrained by sampling the incremental learning mode.

In the method, the initial marking model is trained by adopting a traditional method, and the subsequent marking model is trained by adopting an incremental learning mode.

According to the method, the marked electric power corpora in the training set are expanded, the electric power corpora are marked by the marking model which does not meet the preset requirement, the marked electric power corpora are added into the training set, the training set is expanded by adopting a snowball rolling mode, and the labor cost is greatly reduced.

The invention also relates to a system for constructing the electric corpus marking model, which comprises,

an incremental training module: judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement. And the increment training module retrains the marking model in an increment learning mode.

The initial module comprises a training sample extension module which is used for: and expanding the marked electric power corpora in the training set.

a sequence division module: and taking a sentence as a sequence, and carrying out sequence division on the marked electric power corpus.

A synonym replacement module: and replacing synonyms in the sequence based on the synonym dictionary and the electric power field professional dictionary, and adding the replaced sequence into the training set.

A translation module: and (4) retranslating the original sequence and the replaced sequence, and adding the retranslated sequence into a training set.

A cutting module: cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set; the cropping module retains the sequence prior to cropping.

Splicing modules: splicing a plurality of sequences after the translation with the length smaller than a minimum threshold value, and adding the spliced sequences into a training set; the splicing module retains the sequence prior to splicing.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the claims of the present invention which are filed as the application.

Claims

1. A method for constructing a power corpus tagging model is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

2. The method for constructing an electric corpus tagging model according to claim 1, wherein: the marked electric power corpora in the training set are expanded, and the specific process is that,

3. The method for constructing an electric corpus tagging model according to claim 2, wherein: sequence preservation before cropping and splicing.

4. The method for constructing an electric corpus tagging model according to claim 1, wherein: and (5) retraining the marking model by adopting an incremental learning mode.

5. A system for constructing an electric corpus marking model is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

6. The system according to claim 5, wherein the electric corpus tagging model comprises: the initial module comprises a training sample extension module which is used for: expanding the marked electric power corpora in the training set;

7. The system according to claim 6, wherein the electric corpus tagging model comprises: the cutting module reserves a sequence before cutting, and the splicing module reserves a sequence before splicing.

8. The system according to claim 5, wherein the electric corpus tagging model comprises: and the increment training module retrains the marking model in an increment learning mode.

9. A computer readable storage medium storing one or more programs, characterized in that: the one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods of claims 1-4.

10. A computing device, characterized by: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-4.