CN111489746B

CN111489746B - Power grid dispatching voice recognition language model construction method based on BERT

Info

Publication number: CN111489746B
Application number: CN202010148584.2A
Authority: CN
Inventors: 陈蕾; 郑伟彦; 杨勇; 黄武浩; 张弛; 乐全明; 童力; 陈彤; 黄红兵; 章毅; 刘宏伟; 姜健; 余慧华; 傅婧; 郑洁; 曹青; 向新宇; 卢家驹; 何岳昊
Original assignee: State Grid Zhejiang Electric Power Co Ltd; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Zhejiang Electric Power Co Ltd; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2022-07-26
Anticipated expiration: 2040-03-05
Also published as: CN111489746A

Abstract

The invention relates to the field of power grid dispatching voice recognition, in particular to a power grid dispatching voice recognition language model construction method based on BERT, which comprises the following steps: extracting word granularity semantic features of a power grid dispatching statement; extracting keyword features of a power grid dispatching statement; extracting named entity features of power grid dispatching sentences; and segmenting the power grid dispatching sentences input into the BERT original model by taking words as units to extract position characteristics, and training the BERT original model based on the semantic characteristics, the keyword characteristics, the named entity characteristics and the position characteristics to obtain a power grid dispatching voice recognition language model. The invention has the beneficial effects that: according to the power grid dispatching language characteristics and the dispatching voice recognition application scene, the input feature vector and output probability prediction method of the dispatching sentences of the BERT model is improved, the rationality judgment of the power grid dispatching sentences combined with the dispatching language characteristics is realized, and the accuracy in the aspect of power grid dispatching voice recognition is higher compared with other common language models.

Description

Power grid dispatching voice recognition language model construction method based on BERT

Technical Field

The invention relates to the field of power grid dispatching voice recognition, in particular to a power grid dispatching voice recognition language model construction method based on BERT.

Background

With the expansion of the scale of the power distribution network and the promotion of informatization construction, information related to distribution network command is continuously increased, and a dispatcher needs to perform a large number of repetitive work of issuing orders, receiving orders, checking and the like every day, so that the demand of adopting an intelligent virtual dispatcher to replace repetitive manual labor is generated. The speech recognition link is related to the accurate understanding of the virtual dispatcher on the field personnel report information, and is the basis for the correct processing and sending of dispatching instructions. As two core composition modules of a speech recognition system, an acoustic model and a language model respectively carry out character reconstruction on input speech from the perspective of pronunciation and semantics, wherein the language model has the main function of giving the probability that an input sentence is a reasonable sentence, namely measuring the semantic rationality of the sentence. Since language models often involve semantic understanding in a specific field, it is necessary to design for language features in an application field to improve accuracy of the models.

Currently, there are few studies on speech recognition language models in the power domain. Some researches construct an electric power voice recognition system, but mainly aim at an acoustic model to design, only the selection of training corpora is considered in the aspect of a language model, and the model structure is not improved; some researches add grammar rules in the using process of the power scheduling language model to assist in judging the rationality of the scheduling language, but the rationality of scheduling contents related to power grid terms, named entities and the like is difficult to determine through the grammar rules; some researches consider electric power professional terms, and provide a language model dynamic optimization method capable of adding field words in real time, so that the accuracy of electric power voice recognition is improved, but fuzzy matching with inaccurate pronunciation is not fully designed. In addition, the language models adopted in the research belong to statistical language models, and the neural network language models with more advantages in accuracy and generalization capability are not adopted.

Disclosure of Invention

In order to solve the problems, the invention provides a power grid dispatching voice recognition language model construction method based on BERT.

A power grid dispatching voice recognition language model construction method based on BERT comprises the following steps:

extracting word granularity semantic features of power grid dispatching sentences;

extracting keyword features of a power grid dispatching statement;

extracting named entity characteristics of power grid dispatching sentences;

and segmenting the power grid dispatching sentences input into the BERT original model by taking words as units to extract position characteristics, and training the BERT original model based on the semantic characteristics, the keyword characteristics, the named entity characteristics and the position characteristics to obtain a power grid dispatching voice recognition language model.

Preferably, the extracting word granularity semantic features of the power grid dispatching statement includes:

and segmenting the scheduling statement by taking the word as granularity, wherein the semantic feature vector of each word is generated by adopting a skip-gram model of word2 vec.

Preferably, the extracting the keyword features of the power grid dispatching statement includes:

for each word in the power grid dispatching sentence, the pinyin of the word is divided into an initial consonant, a final sound and an tone, when the syllable is recognized integrally, the word is directly divided into the initial consonant and the final sound, the combined final sound is not divided any more, and the initial consonant or the tone is recorded as a null value by the word without the initial consonant or the tone;

calculating the similarity between each word and each keyword in the power grid dispatching sentence;

and for each word in the power grid dispatching statement, extracting the semantic feature vector of the keyword with the highest similarity, and obtaining the keyword feature vector of the word according to the similarity.

Preferably, the calculating the similarity between each word in the power grid dispatching statement and each keyword includes:

the calculation formula is as follows:

in the formula: sim _sheng The method is characterized in that 1 is taken when the two letters are the same, 0.5 is taken when the letters are different and are respectively corresponding flat tongue sound and warped tongue sound, and 0 is taken in other cases; sim _yun The expression that 1 is taken when the two rhymes are the same, 0.5 is taken when the rhymes are different but are respectively corresponding front nasal sounds and rear nasal sounds, and 0 is taken in the other cases; sim _diao Indicating that 1 is taken at the same time as the two word tone phases, and 0 is taken otherwise.

Preferably, the extracting named entity features of the power grid dispatching statement includes:

constructing a named entity dictionary by using the power grid standing book information, and counting the word numbers of the shortest named entity and the longest named entity in the named entity dictionary and respectively marking as c and d;

for each word in the power grid dispatching sentence, extracting all word sequences with the length of q (q is c, c +1, …, d) including the word, and then calculating to obtain the similarity between each word sequence with the length of q and each word with the length of q in the named entity dictionary;

and calculating the named entity characteristics of each word in the power grid dispatching statement based on the similarity between each word sequence with the length of q and each word with the length of q in the named entity dictionary.

Preferably, the calculating the similarity between each word sequence with the length q and each word with the length q in the named entity dictionary comprises:

the calculation formula is as follows:

in the formula: sim _zi(r) Representing the degree of similarity of the nth word of the word sequence to the nth word of the named entity.

Preferably, the step of calculating the named entity feature of each word in the power grid dispatching statement based on the similarity between each word sequence with the length of q and each word with the length of q in the named entity dictionary comprises:

for each word, it has e corresponding word sequences, wherein the maximum value of the similarity between the s-th word sequence (s is 1,2, …, e) and each named entity is recorded as msim _xu(s) The total e similarity maximum values are set as msim _xu(t) Then, the tth word sequence is called as a matching word sequence of the word, and the named entity feature vector of the word is calculated:

in the formula: f (u) a value representing the named entity feature vector to the u-th dimension; g msim _xu(t) Representing the probability of misrecognition of the matching word sequence, wherein g is 0 when the matching word sequence is completely the same as the named entity, and otherwise is 1; pos indicates that the word is the number of words of the matching word sequence; len denotes the length of the matching word sequence; dim represents the dimension of the named entity feature vector.

Preferably, the training of the BERT original model based on the semantic features, the keyword features, the named entity features and the position features to obtain the power grid dispatching speech recognition language model comprises:

carrying out unsupervised pre-training of MLM tasks on the BERT original model;

and carrying out supervised training on the BERT original model based on reasonable probability of a scheduling statement.

Preferably, the unsupervised pre-training of the MLM task based on the BERT raw model comprises:

and (3) the MLM task randomly masks the input of the partial segmentation unit, a softmax layer is accessed after the corresponding output expression vector of the MLM task is expressed to predict the masked words or characters, and the parameters of the BERT original model are trained in the process of multiple predictions.

Preferably, the performing supervised training of the BERT raw model based on reasonable probability of the scheduling statement includes:

for a power grid dispatching statement containing j words, the input of the kth word (k is 1,2, …, j) is shielded in sequence, and the probability pro of the word output correspondingly is predicted by adopting a BERT original model pre-trained by an MLM task and a softmax layer _k And finally calculating the probability that the power grid dispatching sentence is a reasonable sentence:

the invention has the beneficial effects that:

according to the method for constructing the power grid scheduling speech recognition language model based on the BERT, the input feature vector and output probability prediction method of the scheduling statement of the BERT model can be improved according to the power grid scheduling language characteristics and the scheduling speech recognition application scene, the rationality judgment of the power grid scheduling statement in combination with the scheduling language characteristics is realized, and the method has higher accuracy in the aspect of power grid scheduling speech recognition compared with other commonly used language models.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a schematic flowchart of a method for building a speech recognition language model for power grid dispatching based on BERT according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be further described below with reference to the accompanying drawings, but the present invention is not limited to these embodiments.

The basic idea of the invention is to combine the characteristics of power grid dispatching sentences and provide a dispatching semantic feature, keyword feature and named entity feature extraction method so as to generate multi-class feature vectors of model input sentences; and adjusting the training step of the BERT according to the task characteristics of the power grid dispatching voice recognition, so that the rationality of the dispatching statement can be judged unsupervised by using the BERT based on the neural network language model.

Through analysis, the power grid dispatching statement has the following characteristics: 1) the power grid dispatching statements contain a large number of named entities, for example, substation names, line names, pole names, switch names and the like can be related to field personnel when reporting operating equipment, and for a general language model, the named entities are usually difficult to accurately identify due to lack of corresponding priori knowledge; 2) the terms of the power grid dispatching instructions conform to relevant specifications in the power field, and some power terminology has a relatively fixed naming mode, for example, a substation is named by adopting a 'place name + station', a line is named by adopting a 'place name + number + line', an electric pole is named by adopting a 'number + pole' or a 'place name + number + pole' mode, and the like; 3) because the field worker has a mandarin accent problem and noise interference in outdoor environment, after voice input is performed on the field, through acoustic model recognition, sentences different from correct pronunciation may be obtained, for example, a 'construction branch line' is recognized as a 'deadline branch line', and the like, so that when a recognition result is judged by using a language model, the difference between the input sentences of the language model and the actual sentences needs to be fully considered.

Based on the above thought, the invention provides a power grid dispatching speech recognition language model construction method based on BERT, as shown in FIG. 1, comprising the following steps:

s1: and extracting word granularity semantic features of the power grid dispatching sentences.

Whether it is a statistical-based language model such as n-gram or a neural network-based language model, sentences are usually segmented with word granularity. However, the power grid dispatching statement includes a large number of named entities related to the power field, and in addition, human interference possibly caused by inaccurate pronunciation, word segmentation processing is performed on the power grid dispatching text in advance, which may cause a large deviation between the segmentation mode of the text and the actual meaning, for example, dividing the 'tiger becoming/becoming connected to a555 line' into 'tiger becoming/becoming connected to a555 line' and the like. Even if multiple candidate segmentation schemes are generated, the correct sentence segmentation mode cannot be covered frequently. In order to avoid the influence of word segmentation errors on the feature extraction accuracy, the scheduling statements are directly segmented by taking the characters as granularity, and the semantic feature vector of each character is generated by adopting a skip-gram model of word2 vec. In a distributed representation mode based on word granularity, a piece of power grid scheduling text containing a words is converted into a b-dimensional vectors, wherein the p-th vector (p ═ 1,2, …, a) represents semantic features of the p-th word of the power grid scheduling text, and b is the dimension of each word feature vector.

S2: and extracting the keyword characteristics of the power grid dispatching statements.

Although the grid dispatching language belongs to the category of natural language, the professional terms contained in the grid dispatching language still conform to the specifications of the power field. Through certain keywords with fixed electric power professional terms, semantic units before and after the keywords can be effectively distinguished, for example, through 'changing' and 'line', the transformer substation name field and the line name field of 'flood domain south ocean T649 line reclosing changing from signal to trip' can be effectively identified. Therefore, in order to make the language model more accurately understand the true meaning of the grid scheduling language, it is necessary to extract the features of the keywords, and the specific keywords are shown in table 1:

TABLE 1 keywords for Power grid dispatching language

Since the relevant information of the power grid dispatching is input by the field personnel through voice, the keyword characteristics in the dispatching information are extracted and considered in terms of the pronunciation of the words. Therefore, a similarity calculation method based on pinyin characteristics is proposed. For each character in the scheduling information, the pinyin of the character is firstly split into three parts of initial consonant, final sound and tone, and when the syllable of the whole cognitive reading appears, the character is directly split into the initial consonant and the final sound, for example, yin is split into y and in; the combined vowel is not split any more, for example, a 'ean' formed by combining vowels 'u' and 'an' is regarded as a new vowel; words without initial consonants (e.g., "an") or without tones (e.g., "it" will mark the initial consonants or tones as null values. Then, calculating the similarity between each word and each keyword in the scheduling information, wherein the calculation formula is as follows:

in the formula: sim _sheng Taking 1 when the two letters are the same, taking 0.5 when the letters are different but are respectively corresponding flat tongue and warped tongue sounds (such as 'z' and 'zh'), and taking 0 in the other cases; sim _yun Taking 1 when the two rhymes are the same, taking 0.5 when the rhymes are different but are respectively corresponding anterior nasal sounds and posterior nasal sounds (such as 'an' and 'ang'), and taking 0 in the other cases; sim _diao The phase and voice of two words are simultaneously taken as 1, otherwise, 0 is taken. Finally, for each word in the scheduling information, the semantic feature vector of the keyword with the highest similarity is taken (if a plurality of keywords with the highest similarity exist, the average value of the corresponding semantic feature vectors is taken),and multiplying the similarity to obtain the keyword feature vector of the word.

S3: and extracting the named entity characteristics of the power grid dispatching sentences.

Most named entities in the power grid dispatching language, such as substation names, line names and the like, do not belong to Chinese commonly used vocabularies. Therefore, in the electric power scheduling text corpus, the named entities appear frequently, and the available context information is very limited, so that the identification correctness of the named entities is difficult to determine by means of the context in practical application. Therefore, the power grid ledger information needs to be introduced, and the named entity features of the power grid scheduling language need to be constructed to assist in judging the correctness of the named entity identification.

For this purpose, a named entity dictionary is first constructed using grid ledger information including names of individual power stations, devices, and the like. And meanwhile, counting the word numbers of the shortest named entity and the longest named entity in the named entity dictionary, and respectively recording the word numbers as c and d.

Then, for each word in the power grid dispatching statement, taking all word sequences with the length of q (q ═ c, c +1, …, d) including the word, and then solving the similarity between each word sequence with the length of q and each word with the length of q in the named entity dictionary, wherein the similarity also needs to be defined in the aspect of pronunciation of the word, and the calculation formula is as follows:

in the formula: sim _zi(r) And (3) representing the similarity between the r th word of the word sequence and the r th word of the named entity, wherein the similarity is calculated according to the formula (1).

And finally, forming the named entity characteristics of each word in the power grid dispatching statement. For each word, it has e corresponding word sequences, wherein the maximum value of the similarity between the s-th word sequence (s is 1,2, …, e) and each named entity is recorded as msim _xu(s) E maximum similarity values are obtained, and the maximum value among the e maximum similarity values is set as msim _xu(t) (i.e. maximum similarity of t-th word sequence), then the t-th word sequence is called matching word sequence of the word, and then according to the formula(3) Computing the named entity feature vector for the word:

in the formula: f (u) a value representing the named entity feature vector in the u dimension; g.msim _xu(t) Representing the probability of misrecognition of the matching word sequence, wherein g is 0 when the matching word sequence and the named entity are completely the same, otherwise, g is 1, because the similarity msim of the matching word sequence and the named entity is different _xu(t) The higher the probability that the matching word sequence is a misrecognized result (for example, the misrecognized result of the Huift station is a recovery station), the higher the probability, the probability that the matching word sequence is a misrecognized result, and the like, and the g.msim _xu(t) If the matching word sequence is larger but identical to the named entity, the matching word sequence is considered to be correct, that is, the probability of misrecognition is 0, and therefore g · msim is made by setting g to 0 _xu(t) 0; pos indicates that the word is the first word of the matching word sequence; len is the length of the matching word sequence; dim is the dimension of the named entity feature vector.

S4: and segmenting the power grid dispatching sentences input into the BERT original model by taking words as units to extract position features, and training the BERT original model based on semantic features, keyword features, named entity features and position features to obtain a power grid dispatching voice recognition language model.

And segmenting the power grid dispatching sentences input into the BERT by taking words as units. In the original model structure of BERT, 3 types of features, i.e., semantic features, segment features, and position features, are extracted for each segmentation unit. The semantic feature vector reflects the semantic information of each segmentation unit; the segment feature vector is used for marking which sentence each segmentation unit belongs to when two sentences are simultaneously input into the BERT; the position feature vector is used for representing the position of each segmentation unit in the sentence. In the power grid dispatching speech recognition language model, semantic feature vectors of each word of a dispatching statement are generated through the step S1; because the power grid dispatching instruction appears in a form of a single sentence, the section characteristics do not need to be added into the power grid dispatching language model; and the position feature vector is obtained by automatic learning in the model training process according to a BERT method. Meanwhile, the characteristics of the power grid dispatching language are considered, and the keyword feature vector in the step S2 and the named entity feature vector in the step S3 are added, so that the accuracy of the language model for understanding the power grid dispatching language is improved. Each word of the final scheduling statement contains 4 categories of features, namely semantic features, location features, keyword features and named entity features.

The original BERT Model, when unsupervised pre-training, includes two training tasks, namely Mask Language Model (MLM) and Next Sequence Prediction (NSP). The MLM task randomly masks the input of a partial segmentation unit, accesses a softmax layer after the corresponding output expression vector to predict the masked words or characters, and trains the parameters of BERT in the process of multiple predictions; the NSP task inputs two sentences simultaneously and trains BERT by predicting whether the two sentences are consecutive sentences in an actual article. Similarly, because the power grid dispatching instruction appears in a single sentence form, when a power grid dispatching language model is constructed, the BERT does not need to be pre-trained by an NSP task, and only the MLM task is pre-trained.

After unsupervised pre-training, the original BERT model needs to be subjected to supervised fine tuning to be suitable for a specific natural language processing task, but the fine tuning process needs to consume a large amount of manpower for data annotation. The invention provides a method for calculating the reasonable probability of a scheduling statement by combining tasks of a power grid scheduling language model, namely judging the rationality of the scheduling statement. For a scheduling statement containing j words, the input of the kth word (k is 1,2, …, j) is masked in sequence, and the probability pro of the word corresponding to the output is predicted by adopting a BERT and softmax layer pre-trained by an MLM task _k And finally obtaining the probability that the scheduling statement is a reasonable sentence:

the method can fully utilize the pre-training result of the model on the MLM task on one hand, and does not need to add extra labeled data on the other hand, thereby effectively reducing the model training threshold.

Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A power grid dispatching voice recognition language model construction method based on BERT is characterized by comprising the following steps:

extracting word granularity semantic features of a power grid dispatching statement;

extracting keyword characteristics of a power grid dispatching statement;

extracting named entity features of power grid dispatching sentences;

segmenting a power grid dispatching sentence input into the BERT original model by taking words as units to extract position features, and training the BERT original model based on semantic features, keyword features, named entity features and position features to obtain a power grid dispatching voice recognition language model;

the method for extracting the named entity features of the power grid dispatching statements comprises the following steps:

constructing a named entity dictionary by using the power grid standing book information, and counting the word numbers of the shortest named entity and the longest named entity in the named entity dictionary and respectively recording the word numbers as c and d;

calculating the named entity characteristics of each character in the power grid dispatching information based on the similarity between each character sequence with the length of q and each word with the length of q in the named entity dictionary;

the method for calculating the named entity characteristics of each word in the power grid dispatching information based on the similarity between each word sequence with the length of q and each word with the length of q in the named entity dictionary comprises the following steps:

for each word, let it have e correspondencesThe maximum similarity between the s-th word sequence (s ═ 1,2, …, e) and each named entity is recorded as msim _xu(s) E maximum similarity values are obtained, and the maximum value among the e maximum similarity values is assumed to be msim _xu(t) Then, the tth word sequence is called as a matching word sequence of the word, and the named entity feature vector of the word is calculated:

in the formula: f (u) a value representing the named entity feature vector to the u-th dimension; g.msim _xu(t) Representing the probability of misrecognition of the matching word sequence, wherein g is 0 when the matching word sequence is completely the same as the named entity, or is 1; pos indicates that the word is the first word of the matching word sequence; len denotes the length of the matching word sequence; dim represents the dimension of the named entity feature vector.

2. The method as claimed in claim 1, wherein the extracting word-granularity semantic features of the power grid dispatching sentence comprises:

3. The BERT-based power grid scheduling speech recognition language model construction method of claim 1, wherein the extracting of the keyword features of the power grid scheduling statement comprises:

4. The method as claimed in claim 3, wherein the calculating the similarity between each word and each keyword in the power grid scheduling statement comprises:

the calculation formula is as follows:

5. The method as claimed in claim 1, wherein the calculating of the similarity between each word sequence with length q and each word with length q in the named entity dictionary comprises:

the calculation formula is as follows:

6. The method for constructing the power grid scheduling speech recognition language model based on the BERT of claim 1, wherein the training of the BERT original model based on the semantic features, the keyword features, the named entity features and the position features to obtain the power grid scheduling speech recognition language model comprises the following steps of:

carrying out unsupervised pre-training of MLM tasks on the BERT original model;

7. The method for constructing the power grid dispatching speech recognition language model based on the BERT as claimed in claim 6, wherein the unsupervised pre-training of the MLM task based on the BERT original model comprises:

and the MLM task randomly masks the input of the partial segmentation unit, accesses a softmax layer after the corresponding output representation vector to predict the masked words or characters, and trains the parameters of the BERT original model in the process of multiple prediction.

8. The method as claimed in claim 7, wherein the supervised training of the BERT raw model based on reasonable probability of scheduling statements comprises:

for a power grid dispatching statement containing j words, sequentially masking input of kth word (k is 1,2, …, j), and predicting probability pro corresponding to output as the word by adopting a BERT original model pre-trained by an MLM task and a softmax layer _k And finally calculating the probability that the power grid dispatching sentence is a reasonable sentence: