CN115249479A

CN115249479A - BRNN-based power grid dispatching complex speech recognition method, system and terminal

Info

Publication number: CN115249479A
Application number: CN202210078771.7A
Authority: CN
Inventors: 童浩; 陈筱; 李含宇; 梁少华; 封雨欣; 熊玉仙; 付裕
Original assignee: Yangtze University
Current assignee: Yangtze University
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-10-28

Abstract

The invention discloses a BRNN-based power grid dispatching complex voice recognition method, a system and a terminal, relating to the technical field of voice recognition, and the technical scheme is as follows: acquiring original voice information, and preprocessing the original voice information to obtain voice data; extracting key feature parameters in the voice data to obtain a key feature sequence; constructing an initial acoustic model, and training the initial acoustic model by using a data set to obtain a final acoustic model; inputting the key characteristic sequence into a final acoustic model to obtain a text recognized by original voice information; the initial acoustic model is a BRNN-CTC model which is constructed in an end-to-end mode by combining a BRNN neural network and a CTC target loss function, and the BRNN neural network is constructed by taking initials and finals as basic units. According to the invention, the BRNN neural network is constructed by taking the initial consonant and the final as basic units, and the BRNN-CTC model has higher identification accuracy in a complex environment.

Description

BRNN-based power grid dispatching complex speech recognition method, system and terminal

Technical Field

The invention relates to the technical field of voice recognition, in particular to a BRNN-based power grid dispatching complex voice recognition method, system and terminal.

Background

Automatic Speech Recognition (ASR) is a technology that can convert Speech signals into text. Acoustic models based on GMM-HMM have been considered in the past as the main framework prevalent in the ASR field, where GMM is used to recognize frames as states, act on the mapping of speech inputs to HMM states, HMM is used to combine states into phonemes, and combining the phonemes into words allows the change in the speech signal over a time series to be obtained. With the increasing popularity of deep learning, DNN is introduced into ASR acoustic modeling, and unlike GMM-HMM based acoustic models, DNN models the observation state probability instead of GMM, which enables better utilization of context information to realize nonlinear feature transformation and improve the accuracy of speech recognition. However, the framework based on the DNN-HMM still has the problem that training data needs forced alignment, which causes difficulty in optimization, and the HMM is a generated model whose conditional independence assumption is inconsistent with reality. Under such circumstances, RNN and LSTM can further improve the accuracy of speech recognition by their powerful sequence output capabilities for sequence tagging tasks.

The dispatching team of the power dispatching control center is used as one of power production running mechanisms, and the existing manual dispatching mode is low in efficiency and high in cost and cannot meet increasing workload. For this reason, in the prior art, a voice recognition technology is used in the power dispatching operation, and voice information is converted into a text signal and then converted into a command recognizable by the device. However, the power grid service coverage is wide, not only the operating environment is complex and various, but also the accuracy of the existing voice recognition technology is relatively high due to the influence of the word number and language capability of the operator.

Therefore, how to research and design a BRNN-based power grid dispatching complex speech recognition method, system and terminal capable of overcoming the defects is provided.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a BRNN-based power grid dispatching complex speech recognition method, system and terminal, a BRNN neural network is constructed by taking initials and finals as basic units, a BRNN-CTC model in an end-to-end mode is constructed by combining a CTC target loss function, context information can be better utilized by utilizing a BRNN network structure, and higher recognition accuracy rate is achieved in a complex environment.

The technical purpose of the invention is realized by the following technical scheme:

in a first aspect, a BRNN-based power grid dispatching complex speech recognition method is provided, which includes the following steps:

acquiring original voice information, and preprocessing the original voice information to obtain voice data;

extracting key feature parameters in the voice data to obtain a key feature sequence;

constructing an initial acoustic model, and training the initial acoustic model by using a data set to obtain a final acoustic model;

inputting the key characteristic sequence into a final acoustic model to obtain a text recognized by original voice information;

the method is characterized in that the initial acoustic model is an end-to-end BRNN-CTC model constructed by combining a BRNN neural network and a CTC target loss function, and the BRNN neural network is constructed by taking initials and finals as basic units.

Further, the preprocessing of the original voice information comprises:

pre-emphasis, namely performing high-pass filtering on the original voice information to enhance high-frequency signals and weaken low-frequency signals to obtain voice signals;

framing by adopting an overlapped sampling method, wherein the length of the frame is 10Ms-30Ms, and the ratio of frame shift to frame length is 0-0.5;

windowing, namely performing transformation calculation on elements of each frame after framing and elements corresponding to a window sequence by adopting a Hamming window function;

and end point detection, namely performing effective voice detection based on short-time energy and short-time average zero crossing rate.

Further, the initial acoustic model specifically includes:

let the training data be S = { (x) ₁ ,z ₁ ),(x ₂ ,z ₂ ),...(x _N ,z _N ) Where the number of samples is N, the samples are x, x = (x) ₁ ,x ₂ ,x ₃ ,...,x _T )，x _i ∈R ^m Denotes a division of a sample into T frames, x _i Representing the characteristic parameters of the ith frame, and labeled as z, z = (z) ₁ ,z ₂ ,z ₃ ,...z _U ) Indicating the correct phoneme corresponding to the sample x;

after the characteristics are processed by two RNNs, the posterior probability y of the phoneme is calculated by softmax,

representing the probability that the phoneme is k at time t, the probabilities of all phonemes over a frame add up to 1, i.e.

Further, the data set adopts a data set of multiple types of samples, and the multiple types of samples comprise voice information of male, female, mandarin, dialect, noisy environment, non-noisy environment, single person, multiple persons and different age groups.

Further, the identification method further comprises the following steps:

calculating by character comparison according to the recognized text and a standard result obtained by feedback to obtain a character error rate;

and adjusting the proportion of the corresponding class sample in the data set in a positive correlation mode according to the word error rate.

Further, the calculation formula of the word error rate is specifically as follows:

wherein, P represents the error rate; n represents the total number of characters of the standard result; s represents the number of characters needing to be replaced; d represents the number of characters needing to be deleted; i denotes the number of characters that need to be inserted.

Further, the occupation ratio of the samples in the data set is synchronously adjusted according to the word error rate of the multiple types of samples.

In a second aspect, a BRNN-based power grid dispatching complex speech recognition system is provided, including:

the preprocessing module is used for acquiring original voice information and preprocessing the original voice information to obtain voice data;

the feature extraction module is used for extracting key feature parameters in the voice data to obtain a key feature sequence;

the model building module is used for building an initial acoustic model and training the initial acoustic model by using a data set to obtain a final acoustic model;

the text recognition module is used for inputting the key characteristic sequence into the final acoustic model to obtain a text recognized by the original voice information;

the initial acoustic model in the model construction module is a BRNN-CTC model which is constructed in an end-to-end mode by combining a BRNN neural network and a CTC target loss function, and the BRNN neural network is constructed by taking initials and finals as basic units.

In a third aspect, a computer terminal is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the BRNN-based power grid dispatching complex speech recognition method according to any one of the first aspect is implemented.

In a fourth aspect, a computer-readable medium is provided, on which a computer program is stored, where the computer program is executed by a processor, and the BRNN-based power grid dispatching complex speech recognition method according to any one of the first aspect may be implemented.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the BRNN-based power grid dispatching complex speech recognition method, the BRNN neural network is constructed by taking initial consonants and final consonants as basic units, and a BRNN-CTC model in an end-to-end mode is constructed by combining with a CTC target loss function, so that context information can be better utilized by utilizing a BRNN network structure, and higher recognition accuracy rate is achieved in a complex environment;

2. the method dynamically adjusts the ratio of the data set for training in the BRNN-CTC model according to the word error rate of different types of samples, so that the long-time identification process can always keep higher accuracy.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart in an embodiment of the invention;

FIG. 2 is a CTC-based end-to-end framework diagram in an embodiment of the present invention;

fig. 3 is a block diagram of a system in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1: the BRNN-based power grid dispatching complex speech recognition method, as shown in FIG. 1, comprises the following steps:

s1: acquiring original voice information, and preprocessing the original voice information to obtain voice data;

s2: extracting key feature parameters in the voice data to obtain a key feature sequence;

s3: constructing an initial acoustic model, and training the initial acoustic model by using a data set to obtain a final acoustic model; the initial acoustic model is a BRNN-CTC model which is constructed in an end-to-end mode by combining a BRNN neural network and a CTC target loss function, and the BRNN neural network is constructed by taking initials and finals as basic units;

s4: inputting the key characteristic sequence into a final acoustic model to obtain a text recognized by original voice information; and converting the text into a Linux instruction, transmitting the Linux instruction to a specified dispatching desk, and executing the Linux instruction by a specified dispatching desk workstation to open the picture.

In this embodiment, the pre-processing of the original speech information includes pre-emphasis, framing, windowing, and endpoint detection. And pre-emphasis, namely performing high-pass filtering on the original voice information to enhance high-frequency signals and weaken low-frequency signals to obtain voice signals. And (3) framing by adopting an overlapped sampling method, wherein the length of the frame is 10Ms-30Ms, and the ratio of frame shift to frame length is 0-0.5. And (4) windowing, namely performing transformation calculation on the elements of each frame after framing and the elements corresponding to the window sequence by adopting a Hamming window function. And end point detection, namely performing effective voice detection based on short-time energy and short-time average zero crossing rate.

As shown in fig. 2, the initial acoustic model is specifically: let the training data be S = { (x) ₁ ,z ₁ ),(x ₂ ,z ₂ ),...(x _N ,z _N ) Where the number of samples is N, the samples are x, x = (x) ₁ ,x ₂ ,x ₃ ,...,x _T )，x _i ∈R ^m Denotes a division of a sample into T frames, x _i Representing the characteristic parameters of the ith frame, and labeled as z, z = (z) ₁ ,z ₂ ,z ₃ ,...z _U ) Indicating the correct phoneme corresponding to the sample x; after the characteristics are processed by two RNNs, the posterior probability y of the phoneme is calculated by softmax,

In this embodiment, the data set is randomly divided according to the proportion of 7:3 to obtain a training set and a testing set, and then the training set is divided according to a certain proportion to obtain a verification set and a training set, wherein the verification set is used for adjusting parameters, and the testing set is used for verifying the effect of the model.

The data set adopts a data set of multiple types of samples, and the multiple types of samples comprise voice information of males, females, mandarin, dialects, noisy environments, non-noisy environments, single persons, multiple persons and different age groups.

The identification method further comprises the following steps: calculating by character comparison according to the recognized text and a standard result obtained by feedback to obtain a character error rate; and adjusting the proportion of the corresponding class sample in the data set in a positive correlation mode according to the word error rate.

The calculation formula of the word error rate is specifically as follows:

As an alternative embodiment, the occupation ratio of samples in the data set is synchronously adjusted according to the word error rate of multiple types of samples.

And (3) experimental verification: 13652 pieces of voice provided in this embodiment are processed to obtain 14H50min effective data, the voice quality is uneven, the collected data has problems of male and female mixing, dialect, accent, and the like, the constructed voice adopts mandarin, each voice has about 10 seconds, the number of non-repeated words has 36115, then the training set and the test set are divided according to the proportion of 7:3, and then the feature extraction is performed on the data set.

And inputting the training set into a BRNN-CTC model for training, and under the condition of ensuring the consistency of the hidden layers, because BRNN is bidirectional, the number of neurons of each hidden layer is doubled.

The training result is shown in table 1, the average accuracy comparison result is shown in table 2, the effect of performing speech recognition by using the deep neural network is better than that of a shallow network based on a GMM-HMM, and the BRNN structure can better utilize context information to obtain higher recognition accuracy.

TABLE 1 BRNN-CTC-based training results

TABLE 2 comparison of average accuracy

From the above, the speech recognition based on BRNN for complex environment is really better than the traditional model recognition.

Example 2: the BRNN-based power grid dispatching complex speech recognition system comprises a preprocessing module, a feature extraction module, a model construction module and a text recognition module, as shown in FIG. 3.

The preprocessing module is used for acquiring original voice information and preprocessing the original voice information to obtain voice data. And the feature extraction module is used for extracting key feature parameters in the voice data to obtain a key feature sequence. The model building module is used for building an initial acoustic model and training the initial acoustic model by using a data set to obtain a final acoustic model; the initial acoustic model in the model construction module is a BRNN-CTC model which is constructed in an end-to-end mode by combining a BRNN neural network and a CTC target loss function, and the BRNN neural network is constructed by taking initials and finals as basic units. And the text recognition module is used for inputting the key characteristic sequence into the final acoustic model to obtain a text recognized by the original voice information.

The working principle is as follows: according to the invention, the BRNN neural network is constructed by taking the initial consonant and the final sound as basic units, and the BRNN-CTC model in an end-to-end mode is constructed by combining the CTC target loss function, so that the context information can be better utilized by utilizing the BRNN network structure, and the identification accuracy rate is higher in a complex environment; in addition, the data set ratio for training in the BRNN-CTC model is dynamically adjusted according to the word error rate of different types of samples, so that the long-time identification process can always keep higher accuracy.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The BRNN-based power grid dispatching complex speech recognition method comprises the following steps:

2. The BRNN-based power grid dispatching complex speech recognition method of claim 1, wherein the preprocessing of the raw speech information comprises:

3. The BRNN-based power grid dispatching complex speech recognition method of claim 1, wherein the initial acoustic model is specifically:

4. The BRNN-based power grid dispatching complex speech recognition method of claim 1, wherein the data set is a data set of multiple types of samples, and the multiple types of samples are composed of speech information of male, female, mandarin, dialect, noisy environment, non-noisy environment, single person, multiple persons, and different age groups.

5. The BRNN-based power grid dispatching complex speech recognition method of claim 4, wherein the recognition method further comprises:

6. The BRNN-based power grid dispatching complex speech recognition method of claim 5, wherein the calculation formula of the word error rate is specifically as follows:

7. The BRNN-based power grid dispatching complex speech recognition method of claim 1, wherein the fraction of samples in the data set is synchronously adjusted according to the word error rate of multiple types of samples.

8. A BRNN-based power grid dispatching complex speech recognition system is characterized by comprising:

the method is characterized in that an initial acoustic model in the model building module is a BRNN-CTC model which is built in an end-to-end mode by combining a BRNN neural network and a CTC target loss function, and the BRNN neural network is built by taking initials and finals as basic units.

9. A computer terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the program implements the BRNN-based grid scheduling complex speech recognition method according to any of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, the computer program being executable by a processor to implement the BRNN-based grid dispatch complex speech recognition method according to any of claims 1-7.