CN111831788A - Electric power corpus marking model construction method and system - Google Patents

Electric power corpus marking model construction method and system Download PDF

Info

Publication number
CN111831788A
CN111831788A CN202010547313.4A CN202010547313A CN111831788A CN 111831788 A CN111831788 A CN 111831788A CN 202010547313 A CN202010547313 A CN 202010547313A CN 111831788 A CN111831788 A CN 111831788A
Authority
CN
China
Prior art keywords
sequence
electric power
training set
corpus
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010547313.4A
Other languages
Chinese (zh)
Inventor
田然
苏杨
庞渊源
陈轩
顾彬
王磊
吴子辰
李霁轩
陈鑫
朱晓鸿
王鑫
于广荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd filed Critical Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority to CN202010547313.4A priority Critical patent/CN111831788A/en
Publication of CN111831788A publication Critical patent/CN111831788A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a method for constructing a labeled model of an electric power corpus, which comprises the steps of expanding labeled electric power corpus in a training set and training an initial labeled model; judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement. A corresponding system is also disclosed. The method expands the marked electric power corpora in the training set, marks the electric power corpora by using the marking model which does not meet the preset requirement, adds the marked electric power corpora into the training set, expands the training set by adopting a snowball rolling mode, and greatly reduces the labor cost.

Description

Electric power corpus marking model construction method and system
Technical Field
The invention belongs to the field of text corpus marking in the field of electric power, and relates to a method and a system for constructing an electric power corpus marking model.
Background
In the construction of the text corpus in the power field, the most important thing is the labeling process of the real corpus. In the early days, all data need to be marked manually, the cost is high, and the consistency of marking is poor. In order to solve the problems of the traditional method, a semi-automatic labeling method is often used, wherein the most common method for active learning is the active learning method, a learner is used for automatic corpus labeling, and a selector selects a part from unlabeled samples through a certain algorithm and hands the part to an expert for manual labeling, and the part is added into a corpus to train the learner. However, on the complex problems of knowledge extraction, there are few simple corpus samples, so that the samples selected by the method still need more time and labor cost for labeling, and the training set of the learner cannot be expanded on a large scale.
Disclosure of Invention
In view of this, the invention provides a method and a system for constructing an electric corpus tagging model, which solve the problem that the existing method cannot expand a training set in a large scale.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method for constructing a power corpus tagging model comprises,
expanding the marked electric power corpora in the training set, and training an initial marking model;
judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement.
The marked electric power corpora in the training set are expanded, and the specific process is that,
taking a sentence as a sequence, and carrying out sequence division on the marked electric power linguistic data;
based on a synonym dictionary and a professional dictionary in the electric power field, replacing synonyms in the sequence, and adding the replaced sequence into a training set;
the original sequence and the replaced sequence are translated back, and the translated back sequence is added into a training set;
cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set;
and splicing a plurality of sequences after the translation with the length smaller than the minimum threshold value, and adding the spliced sequences into a training set.
Sequence preservation before cropping and splicing.
And (5) retraining the marking model by adopting an incremental learning mode.
A system for constructing a power corpus tagging model comprises,
an initial module: expanding the marked electric power corpora in the training set, and training an initial marking model;
an incremental training module: judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement.
The initial module comprises a training sample extension module which is used for: expanding the marked electric power corpora in the training set;
the training sample expansion module comprises a training sample expansion module,
a sequence division module: taking a sentence as a sequence, and carrying out sequence division on the marked electric power linguistic data;
a synonym replacement module: based on a synonym dictionary and a professional dictionary in the electric power field, replacing synonyms in the sequence, and adding the replaced sequence into a training set;
a translation module: the original sequence and the replaced sequence are translated back, and the translated back sequence is added into a training set;
a cutting module: cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set;
splicing modules: and splicing a plurality of sequences after the translation with the length smaller than the minimum threshold value, and adding the spliced sequences into a training set.
The cutting module reserves a sequence before cutting, and the splicing module reserves a sequence before splicing.
And the increment training module retrains the marking model in an increment learning mode.
A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform a power corpus tagging model building method.
A computing device comprising one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing a method of electric corpus tagging model construction.
The invention achieves the following beneficial effects: the method expands the marked electric power corpora in the training set, marks the electric power corpora by using the marking model which does not meet the preset requirement, adds the marked electric power corpora into the training set, expands the training set by adopting a snowball rolling mode, and greatly reduces the labor cost.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
As shown in fig. 1, a method for constructing an electric corpus tagging model includes the following steps:
step 1, expanding marked electric power corpora in a training set, and training an initial marking model; wherein the electric power corpus is unstructured corpus.
Expanding the marked electric power corpora in the training set, wherein the specific process is as follows:
s1) performing sequence division on the marked electric power corpus by taking a complete sentence as a sequence.
S2) replacing synonyms in the sequence based on the synonym dictionary and the electric power field professional dictionary, and adding the replaced sequence into the training set.
Synonym substitutions preserve the original tags of the sequences, and for each sequence, 2-6 sequences are available, for example: "the king of the ministry of development of power grid of Jiangsu dynasty, Asahi means that the electricity is stored on a large scale and flexibly and stably called, which is always the direction of effort of the power system. "the power grid of Jiangsu" is replaced by "the power saving Limited company of Jiangsu of China network", and the expression of the mastership of the development department of the power saving Limited company of Jiangsu of China network, namely the mastership of the Ministry of Johnson of China who is who you Xue, the power is stored in a large scale, and the flexible and stable calling is carried out, which is always the direction of the effort of the. ".
S3) carrying out one or more times of retranslation on the original sequence and the sequence after the replacement in the step S2, and adding the retranslated sequence into a training set.
For example, in the embodiment, the translation method is to use english as a pivot language, firstly translate a chinese sentence into an english sentence through a machine translation operation, and then translate the english sentence into a chinese sentence through a machine translation.
For example: "the king of the ministry of development of power grid of Jiangsu dynasty, Asahi means that the electricity is stored on a large scale and flexibly and stably called, which is always the direction of effort of the power system. "the sentence is put into the translator to be translated back for many times, and a new text" the national power grid Jiangsu electric power Limited company development department Master-ren-Xu shows that the realization of the flexible and stable transmission of the large-capacity energy storage is always the direction of the effort of the electric power system. ".
After the translation back, the sequences with the same expression meaning, the same knowledge, the same key words and different sentence patterns are obtained, and the corresponding segments of the translated back sequences also keep the marks of the original sequences.
S4) cutting the translated sequence with the length larger than the maximum threshold value, adding the cut sequence into a training set, and reserving the original sequence; the maximum threshold value is set manually according to actual conditions.
When cutting the long sequence, the long sequence is cut into 2 or more than 2 sequences without cutting knowledge, and the original long sequence is still kept.
S5) splicing a plurality of sequences after being translated back and with the length smaller than a minimum threshold value, adding the spliced sequences into a training set, and reserving the original sequences; the minimum threshold is set artificially according to actual conditions.
Splicing 2 or more than 2 short sequences into one sequence, and simultaneously keeping the original short sequences.
On one hand, the problem of insufficient training data is solved by cutting and splicing the sequence, and meanwhile, the situation that the long sequence and the short sequence are not distributed uniformly is also solved, and the situation that the position information of the model is not sufficiently learned is avoided. Example (c): sequence cutting establishes the long sequence of' national network Jiangsu Power saving Limited company in 1988, 12 and 25 days, and is registered in Nanjing Shanghai No. 215. "tailor" national network Jiangsu Power saving Limited, established in 1988 in 12 months and 25 days. "and" national network Jiangsu Power saving Limited company is registered in Nanjing Shanghai No. 215. ", sequence splicing works similarly.
And 2, judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpora by using the marking model, putting the marked electric power corpora into a training set, expanding the marked electric power corpora in the training set, retraining the marking model by adopting an incremental learning mode, and repeating the step until the trained marking model meets the preset requirement.
And judging whether the trained marking model meets the preset requirement, namely judging whether the marking model meets the expectation or whether a lifting space exists. If the expectation is not reached or the lifting space is still available, the mark model is retrained by sampling the incremental learning mode.
In the method, the initial marking model is trained by adopting a traditional method, and the subsequent marking model is trained by adopting an incremental learning mode.
According to the method, the marked electric power corpora in the training set are expanded, the electric power corpora are marked by the marking model which does not meet the preset requirement, the marked electric power corpora are added into the training set, the training set is expanded by adopting a snowball rolling mode, and the labor cost is greatly reduced.
The invention also relates to a system for constructing the electric corpus marking model, which comprises,
an initial module: expanding the marked electric power corpora in the training set, and training an initial marking model;
an incremental training module: judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement. And the increment training module retrains the marking model in an increment learning mode.
The initial module comprises a training sample extension module which is used for: and expanding the marked electric power corpora in the training set.
The training sample expansion module comprises a training sample expansion module,
a sequence division module: and taking a sentence as a sequence, and carrying out sequence division on the marked electric power corpus.
A synonym replacement module: and replacing synonyms in the sequence based on the synonym dictionary and the electric power field professional dictionary, and adding the replaced sequence into the training set.
A translation module: and (4) retranslating the original sequence and the replaced sequence, and adding the retranslated sequence into a training set.
A cutting module: cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set; the cropping module retains the sequence prior to cropping.
Splicing modules: splicing a plurality of sequences after the translation with the length smaller than a minimum threshold value, and adding the spliced sequences into a training set; the splicing module retains the sequence prior to splicing.
A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform a power corpus tagging model building method.
A computing device comprising one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing a method of electric corpus tagging model construction.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the claims of the present invention which are filed as the application.

Claims (10)

1. A method for constructing a power corpus tagging model is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
expanding the marked electric power corpora in the training set, and training an initial marking model;
judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement.
2. The method for constructing an electric corpus tagging model according to claim 1, wherein: the marked electric power corpora in the training set are expanded, and the specific process is that,
taking a sentence as a sequence, and carrying out sequence division on the marked electric power linguistic data;
based on a synonym dictionary and a professional dictionary in the electric power field, replacing synonyms in the sequence, and adding the replaced sequence into a training set;
the original sequence and the replaced sequence are translated back, and the translated back sequence is added into a training set;
cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set;
and splicing a plurality of sequences after the translation with the length smaller than the minimum threshold value, and adding the spliced sequences into a training set.
3. The method for constructing an electric corpus tagging model according to claim 2, wherein: sequence preservation before cropping and splicing.
4. The method for constructing an electric corpus tagging model according to claim 1, wherein: and (5) retraining the marking model by adopting an incremental learning mode.
5. A system for constructing an electric corpus marking model is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
an initial module: expanding the marked electric power corpora in the training set, and training an initial marking model;
an incremental training module: judging whether the trained marking model meets the preset requirement, if not, marking the unmarked electric power corpus by the marking model, putting the marked electric power corpus into a training set, expanding the marked electric power corpus in the training set, retraining the marking model, and repeating the step until the trained marking model meets the preset requirement.
6. The system according to claim 5, wherein the electric corpus tagging model comprises: the initial module comprises a training sample extension module which is used for: expanding the marked electric power corpora in the training set;
the training sample expansion module comprises a training sample expansion module,
a sequence division module: taking a sentence as a sequence, and carrying out sequence division on the marked electric power linguistic data;
a synonym replacement module: based on a synonym dictionary and a professional dictionary in the electric power field, replacing synonyms in the sequence, and adding the replaced sequence into a training set;
a translation module: the original sequence and the replaced sequence are translated back, and the translated back sequence is added into a training set;
a cutting module: cutting the translated sequence with the length larger than the maximum threshold value, and adding the cut sequence into a training set;
splicing modules: and splicing a plurality of sequences after the translation with the length smaller than the minimum threshold value, and adding the spliced sequences into a training set.
7. The system according to claim 6, wherein the electric corpus tagging model comprises: the cutting module reserves a sequence before cutting, and the splicing module reserves a sequence before splicing.
8. The system according to claim 5, wherein the electric corpus tagging model comprises: and the increment training module retrains the marking model in an increment learning mode.
9. A computer readable storage medium storing one or more programs, characterized in that: the one or more programs include instructions that, when executed by a computing device, cause the computing device to perform any of the methods of claims 1-4.
10. A computing device, characterized by: comprises the steps of (a) preparing a mixture of a plurality of raw materials,
one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-4.
CN202010547313.4A 2020-06-16 2020-06-16 Electric power corpus marking model construction method and system Pending CN111831788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010547313.4A CN111831788A (en) 2020-06-16 2020-06-16 Electric power corpus marking model construction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010547313.4A CN111831788A (en) 2020-06-16 2020-06-16 Electric power corpus marking model construction method and system

Publications (1)

Publication Number Publication Date
CN111831788A true CN111831788A (en) 2020-10-27

Family

ID=72897692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010547313.4A Pending CN111831788A (en) 2020-06-16 2020-06-16 Electric power corpus marking model construction method and system

Country Status (1)

Country Link
CN (1) CN111831788A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008335A (en) * 2018-12-12 2019-07-12 阿里巴巴集团控股有限公司 The method and device of natural language processing
CN110309280A (en) * 2019-05-27 2019-10-08 重庆小雨点小额贷款有限公司 A kind of corpus expansion method and relevant device
CN110597997A (en) * 2019-07-19 2019-12-20 中国人民解放军国防科技大学 Military scenario text event extraction corpus iterative construction method and device
CN111061847A (en) * 2019-11-22 2020-04-24 中国南方电网有限责任公司 Dialogue generation and corpus expansion method and device, computer equipment and storage medium
CN111177373A (en) * 2019-12-12 2020-05-19 北京明略软件系统有限公司 Method and device for obtaining training data and method and device for training model
CN111274798A (en) * 2020-01-06 2020-06-12 北京大米科技有限公司 Text subject term determining method and device, storage medium and terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008335A (en) * 2018-12-12 2019-07-12 阿里巴巴集团控股有限公司 The method and device of natural language processing
CN110309280A (en) * 2019-05-27 2019-10-08 重庆小雨点小额贷款有限公司 A kind of corpus expansion method and relevant device
CN110597997A (en) * 2019-07-19 2019-12-20 中国人民解放军国防科技大学 Military scenario text event extraction corpus iterative construction method and device
CN111061847A (en) * 2019-11-22 2020-04-24 中国南方电网有限责任公司 Dialogue generation and corpus expansion method and device, computer equipment and storage medium
CN111177373A (en) * 2019-12-12 2020-05-19 北京明略软件系统有限公司 Method and device for obtaining training data and method and device for training model
CN111274798A (en) * 2020-01-06 2020-06-12 北京大米科技有限公司 Text subject term determining method and device, storage medium and terminal

Similar Documents

Publication Publication Date Title
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN110597997B (en) Military scenario text event extraction corpus iterative construction method and device
CN109284400B (en) Named entity identification method based on Lattice LSTM and language model
CN110543643B (en) Training method and device of text translation model
CN110134949B (en) Text labeling method and equipment based on teacher supervision
CN103823857B (en) Space information searching method based on natural language processing
CN110852117A (en) Effective data enhancement method for improving translation effect of neural machine
CN105068990B (en) A kind of English long sentence dividing method of more strategies of Machine oriented translation
CN108491459B (en) Optimization method for software code abstract automatic generation model
CN112463424A (en) End-to-end program repair method based on graph
CN108595427B (en) Subjective question scoring method and device, readable storage medium and electronic equipment
CN109753976B (en) Corpus labeling device and method
CN112966529A (en) Neural network machine translation training method, system, medium, equipment and application
CN114638220A (en) Training method of task processing model, task processing method and device
CN113420543A (en) Automatic mathematical test question labeling method based on improved Seq2Seq model
JP2019144706A (en) Device, method and program for learning relationship estimation model
CN112507117A (en) Deep learning-based maintenance suggestion automatic classification method and system
CN111831788A (en) Electric power corpus marking model construction method and system
CN111062216A (en) Named entity identification method, device, terminal and readable medium
CN109800423A (en) Method and apparatus are determined based on the power-off event of power failure plan sentence
CN114936567A (en) Unsupervised machine translation quality estimation method and device based on knowledge distillation
CN113886521A (en) Text relation automatic labeling method based on similar vocabulary
CN114239568A (en) Power grid document automatic generation method and device based on adaptive attention mechanism
CN111553158A (en) Method and system for identifying named entities in power scheduling field based on BilSTM-CRF model
CN111353291A (en) Method and system for calculating optimal label set based on complaint work order training text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination