CN117292680A

CN117292680A - Voice recognition method for power transmission operation detection based on small sample synthesis

Info

Publication number: CN117292680A
Application number: CN202311215139.3A
Authority: CN
Inventors: 宋秦风; 王金志; 范跃祖; 李承斌; 王淮海; 董瑞; 王佳佳; 徐南锦; 田维斌; 孙瑞丽; 崔垚; 陆锦焱; 程益伟; 邓哲
Original assignee: Hefei Power Supply Co of State Grid Anhui Electric Power Co Ltd; Feidong Power Supply Co of State Grid Anhui Electric Power Co Ltd
Current assignee: Hefei Power Supply Co of State Grid Anhui Electric Power Co Ltd; Feidong Power Supply Co of State Grid Anhui Electric Power Co Ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-26

Abstract

The invention provides a voice recognition method for transmission and detection based on small sample synthesis, which comprises the steps of obtaining a plurality of transmission and detection professional text corpora, respectively establishing a transmission and detection professional language model and extracting a transmission and detection professional language organization structure according to the plurality of transmission and detection professional text corpora; generating a large amount of transmission operation and examination professional text data by using the transmission operation and examination professional language organization structure; extracting a small amount of data from the large amount of transmission operation and detection professional text data, and performing voice input by a plurality of transmission operation and detection personnel to establish acoustic models of the plurality of transmission operation and detection personnel; training the rest of the large-amount transmission operation detection professional text data after extracting a small amount of data from the large-amount transmission operation detection professional text data by utilizing the acoustic models of the plurality of transmission operation detection personnel to generate a corresponding voice recognition model, and recognizing test data by utilizing the voice recognition model; decoding the recognition result of the test data by using the established transmission operation detection professional language model to recognize a character result; and establishing a semantic recognition model by using the generated large amount of transmission operation and detection professional text data, and carrying out semantic recognition analysis on the text result recognized by decoding. And the professional voice data of the power transmission operation and detection is accurately identified.

Description

Voice recognition method for power transmission operation detection based on small sample synthesis

Technical Field

The invention relates to the technical field of power mobile inspection voice recognition, in particular to a voice recognition method for power transmission operation inspection based on small sample synthesis.

Background

From the current operation and inspection result recording method of the power transmission operation and inspection, more records of pictures, voices and texts are reserved on site, and the recorded results are systematically recorded after the inspection is finished. The voice data is usually identified by manual identification, so that the working efficiency is low and the repeatability is high. While the universal speech recognition model is often not accurate enough for recognition of specialized vocabulary and lines when recognizing operation-detected speech.

The Chinese patent No. 114822545A discloses a method for improving the speech recognition rate in the professional field, which is mainly used for recognizing the speech in the professional field or the specific industry. The professional field usually relates to a large number of professional terms and proper nouns with local characteristics are combined by each application department in the professional field, such as equipment names containing the names of the places, working section names and even person names of professionals, so that the voice recognition error rate is higher, a secondary difference frequency principle is put forward, a difference frequency special word stock is automatically established, a primary difference frequency sub-stock for storing the local special words and a secondary difference frequency sub-stock for storing the professional terms are included, the difference frequency special words are used as centers to match pinyin and characters, and an arbitrary position conversion mechanism is adopted. The accuracy of voice recognition is improved, and the special vocabulary of the local professional department can be identified. However, the disclosed voice recognition method in the professional field is not suitable for the existing inspection result recording condition in the power mobile inspection field and the field noisy environment of power transmission and operation inspection.

Therefore, the method for accurately recognizing the transmission operation detection voice is provided to solve the problems of low working efficiency and inaccurate recognition existing in the existing transmission operation detection voice recognition.

Disclosure of Invention

The invention aims to provide a method for accurately identifying transmission operation detection voice.

In order to solve the technical problems, the invention provides a method for recognizing voice of power transmission operation detection based on small sample synthesis, which comprises the following steps:

acquiring a plurality of transmission operation detection professional text corpora, respectively establishing a transmission operation detection professional language model and extracting a transmission operation detection professional language organization structure according to the plurality of transmission operation detection professional text corpora;

generating a large amount of transmission operation and examination professional text data by using the transmission operation and examination professional language organization structure;

extracting a small amount of data from the large amount of transmission operation and detection professional text data, and performing voice input by a plurality of transmission operation and detection personnel to establish acoustic models of the plurality of transmission operation and detection personnel;

training the rest of the large-amount transmission operation detection professional text data after extracting a small amount of data from the large-amount transmission operation detection professional text data by utilizing the acoustic models of the plurality of transmission operation detection personnel to generate a corresponding voice recognition model, and recognizing test data by utilizing the voice recognition model;

decoding the recognition result of the test data by using the established transmission operation detection professional language model to recognize a character result; and

and establishing a semantic recognition model by using the generated large amount of transmission operation and detection professional text data, and carrying out semantic recognition analysis on the text result obtained by decoding and recognition.

Further, the method further comprises judging whether the result of semantic recognition analysis of the decoded and recognized text result by the semantic recognition model is consistent with the decoded and recognized text result or not:

if yes, outputting the text result identified by decoding;

and if the text results do not accord with the text results, carrying out semantic correction on the text results identified by the decoding based on the result of the semantic identification analysis.

Further, the step of judging whether the result of the semantic recognition analysis of the text result recognized by the decoding by the semantic recognition model is consistent with the text result recognized by the decoding specifically includes the following steps:

extracting key fields from the generated large amount of transmission operation and detection professional text data, labeling and training to generate a corresponding voice recognition model;

carrying out semantic segmentation and keyword extraction on the text result of the voice recognition by the established semantic recognition model;

and carrying out similarity judgment on the extracted keywords and the key fields extracted from the large amount of transmission operation and inspection professional text data.

Further, in the process of performing similarity judgment on the extracted keywords and the key fields extracted from the large amount of transmission operation and detection professional text data, the similarity judgment satisfies the formula:

wherein N represents an N-th field, m represents m-th data in a field database, WER (N, m) is word error rate output of output words and reference text, S represents the number of words replaced, D represents the number of words deleted, I represents the number of words inserted, and N represents the number of words of the reference text.

Further, the method for performing semantic correction on the text result identified by decoding based on the result of the semantic recognition analysis further comprises setting a correction threshold value, and completing correction adjustment on field information conforming to a correction interval, so that a formula is satisfied:

output(n)＝data(n,x)0＜WER(n)＜check_threhold

wherein check_threshold is the set correction threshold, and data (n, x) is the nth type of the xth data in the database, where WER (n, x) =wer_n, i.e. the correction method includes:

if the word error rate is 0, no correction is needed;

if the word error rate is greater than 0 and smaller than the correction threshold value, the field information with the highest similarity in the transmission operation detection professional text database is taken out for correction adjustment;

if the word error rate is greater than the correction threshold, information extraction errors exist and cannot be corrected.

Further, the method includes the steps of extracting a small amount of data from the large amount of transmission operation and detection professional text data, performing voice input by a plurality of transmission operation and detection personnel to establish an acoustic model of the plurality of transmission operation and detection personnel, and fusing noise data of a simulated transmission operation and detection site with the voice input of the plurality of transmission operation and detection personnel to generate more complete and more real transmission operation and detection voice data.

Preferably, the decoding method in the voice recognition process is a greedy search method and satisfies the formula:

score_greedy(t)＝max(P_t)

where score_greedy (t) represents the fraction at time t, p_t represents the probability distribution of the model output at time t, and max operation represents the probability of choosing the maximum.

Further, the decoding method in the speech recognition process further comprises a set probability threshold value, and the output probability in the decoding method is compared with the set probability threshold value:

if the output probability is greater than or equal to the set probability threshold, the probability threshold is the output of greedy decoding;

if the output probability is smaller than the set probability threshold, outputting a result by using a decoding method of the bundle search, wherein the bundle search satisfies the formula:

wherein, thredhold is the judgment setting threshold value,

if the greedy output of the score_greddy is greater than or equal to the set threshold, the decoding result output is a greedy decoding output sequence;

if any one step score_greddy (i) is smaller than the set threshold, the decoding result output is the decoding output sequence beam_sreach_output of the bundle search.

Further, the beam_sreach_output output formula:

Beam_sreach_output＝max(y ^∧ (i)),i＝1,…,k

wherein in the bundle search decoding process, k partial solutions are generated, each of which is a sequence y ^∧ (i),i＝1,…,k ₀ Calculating the score of each partial solution, wherein the calculation formula of the score satisfies the following conditions:

score(y ^∧ (i))＝sum(logP(y_t(i)∣y_1(i),…,y_(t-1)(i)))

wherein y_t (i) is an element representing the partial decomposition i at time t, y _{_} 1 (i), …, y_ (t-1) (i) represents the elements of the partial decomposition i at time steps 1 to t-1.

Further, the cluster search is performed for the voice data with low score of the greedy search, and then judgment is performed, so that dynamic balance of accuracy and reaction speed in the decoding process is realized.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, a small amount of data is extracted from a large amount of transmission operation and detection professional text data, and voice input is carried out by different transmission operation and detection personnel to establish acoustic models of the different transmission operation and detection personnel, so that decoding, semantic analysis and correction are further carried out, a transmission operation and detection voice recognition method based on small sample synthesis is realized, and the voice recognition is accurate and high in recognition efficiency.

Drawings

FIG. 1 is a flow chart of the present invention;

Detailed Description

In order to make the technical solutions and technical effects of the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments.

The invention provides a method for recognizing voice of power transmission operation detection based on small sample synthesis, which comprises the following steps:

acquiring a plurality of transmission operation detection professional text corpora, respectively establishing a transmission operation detection professional language model and extracting a transmission operation detection professional language organization structure according to the plurality of transmission operation detection professional text corpora: the method specifically comprises the steps of obtaining text corpus information such as names, voltage levels, defect positions, defect parts, defect basis content and the like of the power transmission lines in the operation and inspection area.

The following is a corpus example of a section of defect content of power transmission operation inspection:

"insulator broken";

training the collected text corpus by using a kenlm training tool to generate an N-gram language model for a decoding stage of speech recognition, wherein a model estimation probability formula is as follows:

wherein the probability of predicting the current word (nth) given all of the history information is equivalent to the probability of predicting the current word given the first N-1 words; the method also comprises the step of optimizing the N-gram language model training by adopting a corrected Kneser-Ney smoothing method, so that the problems of incomplete corpus and the like are avoided.

And generating a large amount of transmission operation and examination professional text data by using the transmission operation and examination professional language organization structure.

The following is an example of integrated text data:

the "channel inspection 220kV nest line 28 number tower medium phase insulator breakage", wherein "channel inspection" is inspection type, "220kV" is voltage grade, "nest line" is line name, "28 number tower" is tower number, "medium phase" is defect position, "insulator breakage" is defect content in the text data of this example.

And carrying out data preprocessing on the transmission operation inspection professional text data, converting an original text into phonemes, segmenting Chinese and English, and simultaneously checking whether illegal data exist.

Extracting a small amount of data from a large amount of transmission operation and detection professional text data, and performing voice recording by a plurality of transmission operation and detection personnel to establish acoustic models of the plurality of transmission operation and detection personnel, wherein an MFA tool is utilized to align audio frequency with the transmission operation and detection professional text data and complete time length segmentation, and simultaneously complete independent modeling of time length, fundamental frequency and energy (the time length of a phoneme directly influences pronunciation length and integral rhythm, the fundamental frequency is another characteristic influencing emotion and rhythm, the energy directly influences the amplitude of a frequency spectrum and directly influences the volume of the audio), and the encoded time length information is used as input of a voice attribute modeling network structure; the output of the duration prediction is also taken as the input of the fundamental frequency and energy prediction, and finally the output of the duration prediction is added together with the predicted output of the fundamental frequency and energy, and is taken as the input of a downstream network, and the formula is as follows:

x＝x+pitch_embedding+energy_embedding

wherein x is the output of the encoder after the time length information is unfolded, pitch_embedding is the fundamental frequency embedded vector, and energy_embedding is the energy embedded vector; the output information is converted into mel frequency spectrum by the decoder and sent to the vocoder to generate audio data.

Training the residual large-amount power transmission operation inspection professional text data after extracting a small amount of data from the large-amount power transmission operation inspection professional text data by utilizing the acoustic models of the plurality of power transmission operation inspection personnel to generate a corresponding voice recognition model, and recognizing test data by utilizing the voice recognition model: training the operation detection voice data by using an end-to-end model method to generate a corresponding voice recognition model; the speech recognition model generated by training adopts a pretrain+finetune fine tuning method, so that the model training efficiency can be improved, and the method is suitable for the conditions of small data volume and more training models.

And decoding the identification result of the test data by using the established transmission operation detection professional language model to identify a text result, wherein the method further comprises the step of adding criterion setting in the decoding process of the transmission operation detection professional language model, and adaptively selecting a more proper and efficient decoding method according to the criterion result.

And establishing a semantic recognition model by using the generated large amount of transmission operation and detection professional text data, and carrying out semantic recognition analysis on the decoded and recognized text result.

Judging whether the result of semantic recognition analysis of the text result recognized by the decoding by the semantic recognition model is consistent with the text result recognized by the decoding or not:

if yes, outputting the text result identified by decoding;

Judging whether the result of semantic recognition analysis on the decoded and recognized text result by the semantic recognition model is consistent with the decoded and recognized text result or not, specifically comprising the following steps:

and extracting key fields from the generated large amount of transmission operation and detection professional text data, labeling and training to generate a corresponding voice recognition model.

The following is an example of text data extracted from a key field:

the phase (defect position) insulator (defect part) in the channel inspection (inspection type) 220kV (voltage class) nest line (line name) No. 28 tower (tower number) is damaged (defect content).

And carrying out semantic segmentation and keyword extraction on the text result of the voice recognition by the established semantic recognition model.

And (3) performing similarity judgment on the extracted keywords and the extracted key fields in a large amount of transmission operation and inspection professional text data, removing extracted abnormal text field information, and reorganizing corrected field information to generate corrected voice recognition text.

The following is an example of an abnormal piece of text data:

the channel inspection bridge east line o is not damaged to a phase insulator in a 220kV nest line No. 28 tower, the example is a misread condition, and the bridge east line o is not paired and marked as an abnormal field label.

Wherein, the similarity judgment satisfies the formula:

The method for carrying out semantic correction on the decoded and recognized text result based on the result of semantic recognition analysis further comprises the steps of setting a correction threshold value, and completing correction adjustment on field information conforming to a correction interval so as to satisfy the formula:

output(n)＝data(n,x)0＜WER(n)＜check_threhold

wherein check_threshold is a set correction threshold, and data (n, x) is the nth type of the xth data in the database, where WER (n, x) =wer_n, i.e. the correction method includes:

if the word error rate is 0, no correction is needed;

if the word error rate is greater than 0 and smaller than the correction threshold value, extracting field information with highest similarity in the transmission operation detection professional text database, and correcting and adjusting;

if the word error rate is greater than the correction threshold, there is an information extraction error, which cannot be corrected.

Further, a small amount of data is extracted from a large amount of transmission operation and detection professional text data, and voice input is carried out by a plurality of transmission operation and detection personnel to establish an acoustic model of the plurality of transmission operation and detection personnel, wherein the method further comprises the step of fusing noise data of a simulated transmission operation and detection site with voice input of the plurality of transmission operation and detection personnel to generate more complete and more real transmission operation and detection voice data, and the model robustness is improved.

The decoding method in the speech recognition process is preferably a greedy search method and satisfies the formula:

score_greedy(t)＝max(P_t)

The decoding method in the speech recognition process further comprises a set probability threshold value, and the output probability in the decoding method is compared with the set probability threshold value:

if the output probability is greater than or equal to a set probability threshold, the probability threshold is the output of greedy decoding;

wherein, threshold is determined and set,

if the greedy output of all time steps score_greddy is greater than or equal to a set threshold, outputting a decoding result output as a greedy decoding output sequence;

if any step score_greddy (i) is smaller than the set threshold, the final decoding result output is the decoding output sequence beam_sreach_output of the bundle search. The beam_sreach_output output formula is as follows:

Beam_sreach_output＝max(y ^∧ (i)),i＝1,…,k

wherein in the bundle search decoding process, k partial solutions are generated, each of which is a sequence y ^∧ (i),i＝1,…,k ₀ The score of each partial solution is calculated by multiplying the probability of each element in the partial solution and taking the logarithm, and the calculation formula of the score satisfies the following conditions:

score(y ^∧ (i))＝sum(logP(y_t(i)∣y_1(i),…,y_(t-1)(i)))

where y_t (i) is an element representing the partial decomposition i at time t, y_1 (i), …, y_ (t-1) (i) represents an element of the partial decomposition i at time steps 1 to t-1.

The accuracy of the cluster search is greater than that of the greedy search, but the time efficiency is longer, so that the cluster search is carried out for voice data with low greedy search score and then the judgment is carried out, and the dynamic balance of the accuracy and the response speed in the decoding process is realized.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for voice recognition of power transmission operation based on small sample synthesis, the method comprising:

2. The method of claim 1, further comprising determining whether the result of the semantic recognition analysis of the decoded and recognized word result by the semantic recognition model corresponds to the decoded and recognized word result:

if yes, outputting the text result identified by decoding;

3. The method for recognizing voice of electric transmission operation and test based on small sample synthesis according to claim 2, wherein the step of judging whether the result of the semantic recognition analysis of the decoded and recognized text result by the semantic recognition model is consistent with the decoded and recognized text result comprises the following steps:

4. The method for voice recognition of electric transmission and detection based on small sample synthesis according to claim 3, wherein in the process of performing similarity judgment on the extracted keywords and the extracted keywords in the large amount of electric transmission and detection professional text data, the similarity judgment satisfies the formula:

wherein N represents an N-th field, m represents m-th data in the field database, VER (N, m) is word error rate output of output words and reference text, S represents the number of words replaced, D represents the number of words deleted, I represents the number of words inserted, and N represents the number of words of the reference text.

5. The method for voice recognition of electric transmission and detection based on small sample synthesis according to claim 4, wherein the method for performing semantic correction on the decoded and recognized text result based on the result of the semantic recognition analysis further comprises setting a correction threshold, and performing correction adjustment on field information conforming to a correction interval, so as to satisfy a formula:

output(n)＝data(n,x)0＜WER(n)＜check_threhold

wherein check_threshold is the set correction threshold, and data (n, x) is the nth type of the xth data in the database, where WER (n, x) =ver_n, i.e. the correction method includes:

if the word error rate is 0, no correction is needed;

6. The method for voice recognition of power transmission tests based on small sample synthesis according to claim 1, wherein,

the method comprises the steps of extracting a small amount of data from a large amount of transmission operation and detection professional text data, and performing voice input by a plurality of transmission operation and detection personnel to establish an acoustic model of the plurality of transmission operation and detection personnel, wherein the method further comprises the step of fusing noise data of a simulated transmission operation and detection site with the voice input of the plurality of transmission operation and detection personnel to generate more complete and more real transmission operation and detection voice data.

7. The method for voice recognition of electric transmission operation and test based on small sample synthesis according to claim 1, wherein the decoding method in the voice recognition process is a greedy search method and satisfies the formula:

score_greedy(t)＝max(P_t)

8. The method for voice recognition of power transmission tests based on small sample synthesis of claim 7, wherein the decoding method in the voice recognition process further comprises a set probability threshold, and the output probability in the decoding method is compared with the set probability threshold:

wherein, threshold is determined and set,

if the greedy output of the score_greedy in all time steps is greater than or equal to the set threshold, outputting the decoding result to be a greedy decoding output sequence;

if any one step of score_greedy (i) is smaller than the set threshold, the decoding result output is the decoding output sequence beam_sreach_output of the bundle search.

9. The method for voice recognition of power transmission tests based on small sample synthesis of claim 8, wherein the beam_sreach_output output formula:

Beam_sreach_output＝max(y ^∧ (i)),i＝1,…,k

score(y ^∧ (i))＝sum(logP(y_t(i)∣y_1(i),…,y_(t-1)(i)))

10. The method for voice recognition of electric transmission and detection based on small sample synthesis according to claim 9, wherein the cluster search is performed for voice data with low score of the greedy search and then the cluster search is judged to realize dynamic balance of accuracy and reaction speed in the decoding process.