CN110767218A

CN110767218A - End-to-end speech recognition method, system, device and storage medium thereof

Info

Publication number: CN110767218A
Application number: CN201911057703.7A
Authority: CN
Inventors: 李浩然; 颜丙聪; 赵力; 张玲
Original assignee: Nanjing Lizhi Psychological Big Data Industry Research Institute Co Ltd
Current assignee: Nanjing Lizhi Psychological Big Data Industry Research Institute Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-07

Abstract

The application discloses an end-to-end voice recognition method, a system, a device and a storage medium thereof, wherein the end-to-end voice recognition system based on a convolutional neural network and an attention mechanism is used for realizing deep learning by fusing the attention mechanism into the convolutional neural network and constructing a complete voice recognition network model by using a CTC loss function, and a spectrogram of voice is extracted from original voice data to be used as input of a CNN (voice channel network), so that the voice performance is improved, the information loss caused by manual feature extraction is greatly reduced, and the method has a good application prospect.

Description

End-to-end speech recognition method, system, device and storage medium thereof

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to an end-to-end speech recognition method, system, and apparatus based on a convolutional neural network and an attention mechanism, and a storage medium thereof.

Background

Speech recognition is an active research area in recent years and is an important man-machine interaction means. Typical implementations of speech recognition systems are: the input analog speech signal is first pre-processed, including pre-filtering, sampling and quantization, windowing, endpoint detection, pre-emphasis, etc. After the voice signal is preprocessed, the next important loop is the feature parameter extraction. The features are then learned through algorithms of machine learning and deep learning, such as HMM or LSTM, among others.

The above work has led to the study of speech recognition, but there are some problems worth intensive study, as follows:

(1) recognition of accented (Dialect) speech;

(2) the extraction process from original voice to voice features inevitably leads to information loss, and whether the lost information has influence on the final voice recognition effect is unknown;

(3) the effect of background noise on the recognition effect.

How to overcome the above problems is currently needed.

Disclosure of Invention

In order to solve the above technical problems, embodiments of the present application provide an end-to-end speech recognition method, system, apparatus and storage medium thereof based on a convolutional neural network and an attention mechanism.

A first aspect of an embodiment of the present application provides an end-to-end speech recognition method based on a convolutional neural network and an attention mechanism, which may include:

collecting voice data, carrying out unified normalization processing on the whole voice data, and then segmenting the voice data according to a database label;

performing frame windowing on the segmented voice and then acquiring a frequency spectrum by using fast Fourier transform;

introducing an attention mechanism, and combining the attention mechanism with a convolutional neural network to construct a complete speech recognition network model;

and training a voice recognition network model, taking the predicted voice data as the input of the voice recognition network model, training and learning the parameters of the voice recognition network model, and obtaining the required voice recognition network model for recognition after evaluating the word error rate.

Further, the step of performing unified normalization processing on the whole voice data and then segmenting the voice data according to the database tags includes:

normalizing the range of the whole voice to a threshold range taking a point 0 as a symmetry center, wherein the physical meanings of the whole voice before and after normalization, which are expressed at the position where the value is zero, are all unvoiced segments.

Further, the introducing attention mechanism, combining the attention mechanism with the convolutional neural network, includes:

and introducing an attention mechanism into the convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A is subjected to Softmax regression and accords with a probability distribution value interval.

Furthermore, the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.

A second aspect of the embodiments of the present application provides an end-to-end speech recognition system based on a convolutional neural network and an attention mechanism, including:

the voice receiving unit is used for receiving the whole voice and segmenting the voice after normalizing the voice;

the spectrum acquisition unit is used for acquiring spectrum data from the segmented voice data by utilizing Fourier transform;

the model building unit is used for combining an attention mechanism with a convolutional neural network to build a complete speech recognition network model;

and the training model unit is used for optimizing model parameters by using the voice data as training contents and taking the word error rate as an optimization target training model.

Further, the voice receiving unit includes: normalizing the range of the whole voice to a threshold range taking a point 0 as a symmetry center, wherein the physical meanings of the whole voice before and after normalization, which are expressed at the position where the value is zero, are all unvoiced segments.

Further, the spectrum acquisition unit includes:

the window function processing unit is used for performing frame windowing on the voice data obtained by segmentation by using a preset window function;

and the frequency spectrum acquisition unit is used for carrying out fast Fourier transform on the data processed by the window function and only taking half of the length.

Further, the building model unit includes:

introducing an attention mechanism into a convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A passes Softmax and accords with a probability distribution value interval;

the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model framework, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.

In a third aspect, an embodiment of the present application provides an identification apparatus, which includes a memory and a processor, where the memory stores computer-executable instructions, and the processor executes the computer-executable instructions on the memory to implement the method of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method of the first aspect.

In the embodiment of the application, the end-to-end voice recognition system based on the convolutional neural network and the attention mechanism realizes deep learning by fusing the attention mechanism into the convolutional neural network and constructing a complete voice recognition network model by using a CTC loss function, extracts a spectrogram of voice from original voice data as input of a CNN (convolutional neural network), so that the performance of the voice is improved, information loss caused by manual feature extraction is greatly reduced, and the end-to-end voice recognition system has a good application prospect.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of the steps of an end-to-end speech recognition system based on a convolutional neural network and an attention mechanism of the present invention.

FIG. 2 is a schematic flow diagram of FIG. 1;

fig. 3 is a line graph of WER results from testing the model of the present invention on a validation set.

FIG. 4 is a schematic block diagram of an identification system provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of an identification device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Referring to fig. 1, which is a schematic flow chart of an identification method provided in an embodiment of the present application, as shown in the figure, the method may include:

101: and voice data is collected, and the whole voice data is subjected to unified normalization processing and then is segmented according to the database label.

It is understood that the whole speech is determined according to the speech interval time in the collected speech data, the sentence break of the speech is realized by the pause in the dialogue, a continuous speech is taken as the whole speech, the normalization of the data is performed on the whole speech after the speech is collected, the interval attributed to the fact that the 0 point is taken as the center of symmetry, in the embodiment, the normalization range is [ -1, 1], and the physical meaning expressed when the speech value is zero before and after the normalization is unchanged is the unvoiced segment.

And when the voice is cut, the voice subjected to unified normalization processing is cut according to the database label. The database is a professional database which is established by a phoneticist and used for researching voice recognition, in the embodiment, the database is a professional database which is established by Qinghua university and used for researching voice recognition, and is recorded by a single carbon particle microphone in a quiet office environment, and the total time length is more than 30 hours. Most of the people participating in the recording will be college students who speak fluent mandarin. The sampling frequency is 16kHz and the sampling size is 16 bits. And after segmentation according to the database labels, 10000 effective voices are obtained, wherein 500 voices serve as a verification set, 500 voices serve as a test set, the rest voices serve as a training set, the longest L of voice data is 343208, and the duration is about 21.45 seconds.

102: and performing frame windowing on the segmented voice and then acquiring a frequency spectrum by using fast Fourier transform.

It will be appreciated that windowing and framing are both pre-processing stages of speech signal feature extraction. Firstly, dividing frames, then adding windows, and then performing fast Fourier transform.

Framing: in short, a segment of a speech signal is not considered stationary as a whole, but can be considered stationary locally. In the later speech processing, a stationary signal is required to be input, so that the whole speech signal is framed, that is, divided into a plurality of segments. The signal can be considered stable in the range of 10-30ms, and generally has a frame of not less than 20ms, and a frame shift of about 1/2 duration. The frame shift is an overlapping area between two adjacent frames, so as to avoid an excessive change between two adjacent frames.

Windowing: after windowing according to the method, discontinuous places appear at the beginning and the end of each frame, so that the more frames are divided, the larger the error with the original signal is. Windowing is to solve this problem, so that the signal after framing becomes continuous, and each frame will show the characteristics of a periodic function. Hamming windows are typically added in speech signal processing.

As a specific embodiment, the segmented speech is windowed and framed;

in the framing processing, the frame length I is 1024, the interframe overlapping rate p is 25%, and the maximum frame number H is 447.

The added window function is a Hamming window W (N, α), and the calculation formula is as follows, wherein W (N, α) is (1- α) - α cos (2 pi N/(N-1)), N is more than or equal to 0 and less than or equal to N-1, wherein α takes the value of 0.46, and N is the value range of N and represents the length of the Hamming window.

The speech is subjected to fast Fourier transform to obtain a frequency spectrum, and the frequency spectrum is a symmetrical formula, so that the length of the frequency spectrum is only half of the length of the frequency spectrum. The fast fourier transform is formulated as:

since this step belongs to a common technical means in speech recognition, it is not repeated.

103: and (4) introducing an attention mechanism, and combining the attention mechanism with the convolutional neural network to construct a complete speech recognition network model.

It can be understood that, in the present application, an attention mechanism is introduced into the convolutional neural network, and the attention mechanism is implemented by multiplying two fully-connected layers a and B, where the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution value that is obtained after the weight of a passes Softmax and conforms to a probability distribution value interval.

In the construction process of the voice recognition network model, a CNN + CTC model is adopted, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers are adopted, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.

The convolutional layer is used for extracting features of the spectrogram, the pooling layer is used for further extracting main features and reducing parameters, and part of neurons are randomly discarded by dropout after each pooling layer, so that the network is prevented from being over-trained. After the convolutional layer and the pooling layer, the images are compressed into a form which can be input by the fully connected layer by using a reshape layer, then an attention mechanism of weight is introduced in a form of multiplication of the fully connected layer, and then classification is realized by the fully connected layer. Other network parameter settings are shown in table 1:

parameter(s)	Value of
		Initial learning rate	0.0001
Training batch size	32
		Interlayer cell connection rate (dropout)	0.6
Convolution output channel	2

104: and training a voice recognition network model, taking the predicted voice data as the input of the voice recognition network model, training and learning the parameters of the voice recognition network model, and obtaining the required voice recognition network model for recognition after evaluating the word error rate.

It can be understood that after the model is built, the data parameters need to be continuously modified through a large amount of data training, so that the model is more suitable for the applicable object, and the voice data can be accurately output into the text data in the actual use.

As a specific embodiment, when training the speech recognition network model, the predicted speech data is used as the input of the speech recognition network model, the parameters of the speech recognition network model are trained and learned, and the WER (word error rate, WordError) is passedRate), in order to make the recognized word sequence and the standard word sequence consistent, some words need to be replaced, deleted or inserted, and the total number of the inserted, replaced or deleted words is divided by the percentage of the total number of the words in the standard word sequence, namely the WER. The calculation formula is as follows:

wherein S is the number of replacement, D is the number of deletion, I is the number of insertion, and N is the total number of Chinese characters.

Through the evaluation, the content of the whole model is continuously corrected so as to realize an ideal output result of the model.

In the specific training process, every 200 times of training, performing one verification on the verification set, recording the WER of the verification set, finally counting the result as shown in FIG. 3 in the verification set, with the superposition of the training times steps, the overall WER finally converges to 20.35%, and finally the WER of 19.80% is obtained in the test set,

to sum up, the end-to-end speech recognition system based on the convolutional neural network and the attention mechanism of the present invention realizes deep learning by fusing the attention mechanism into the convolutional neural network and constructing a complete speech recognition network model by using a CTC loss function, extracts a speech spectrogram of speech from original speech data as input of CNN, so as to improve speech performance, greatly reduce information loss caused by manual feature extraction, and have good application prospects.

Embodiments of the present application also provide an end-to-end speech recognition system based on a convolutional neural network and an attention mechanism, the system being configured to perform any of the above. Specifically, referring to fig. 4, fig. 4 is a schematic block diagram of a positioning apparatus provided in an embodiment of the present application. The device of the embodiment comprises: speech receiving unit 310, spectrum obtaining unit 320, model building unit 330, and model training unit 340.

The voice receiving unit 310 is configured to receive a whole segment of voice, normalize the segment of voice, and then segment the normalized segment of voice.

And a spectrum obtaining unit 320, configured to obtain spectrum data from the segmented voice data by using fourier transform.

And a model building unit 330 for combining the attention mechanism with the convolutional neural network to build a complete speech recognition network model.

And a training model unit 340, configured to optimize model parameters by using the speech data as training content, and train a model with the word error rate as an optimization target.

The speech receiving unit 310 normalizes the range of the whole speech to the threshold range with 0 point as the center of symmetry, wherein the physical meanings of the whole speech before and after normalization expressed at the position where the value is zero are all unvoiced segments.

As an alternative embodiment, the normalization range is [ -1, 1], and the physical meanings expressed when the speech value is zero before and after normalization are unchanged, and are silent sections.

The spectrum obtaining unit 320 is specifically configured to perform frame windowing on the segmented speech and then obtain a spectrum by using fast fourier transform.

As an optional implementation manner, the spectrum obtaining unit 320 includes:

the framing unit 321 determines the number of frames of the segmented speech. In this embodiment, the frame length I in the framing processing is 1024, the interframe overlapping rate p is 25%, and the maximum frame number H is 447.

After windowing by the windowing unit 322, discontinuous portions appear at the beginning and end of each frame, so that the more frames are, the larger the error with the original signal is. Windowing is to solve this problem, so that the signal after framing becomes continuous, and each frame will show the characteristics of a periodic function. Hamming windows are typically added in speech signal processing.

In this embodiment, the added window function is a Hamming window W (N, α), and the calculation formula is as follows, where W (N, α) ═ 1- α - α cos (2 pi N/(N-1)), 0 ≦ N-1, where α takes a value of 0.46, and N is a value range of N, which represents the length of the Hamming window.

Fast fourier transform section 323 performs fast fourier transform on the speech to obtain a spectrum, and the spectrum is symmetric and therefore only half of the length. The fast fourier transform is formulated as:

the model building unit 330 is used to combine the attention mechanism with the convolutional neural network to build a complete speech recognition network model.

It can be understood that, in the present application, an attention mechanism is introduced into the convolutional neural network, and the attention mechanism is implemented by multiplying two fully-connected layers a and B, where the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution value that is obtained after the weight of a passes Softmax and conforms to a probability distribution value interval. In the construction process of the voice recognition network model, a CNN + CTC model is adopted, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers are adopted, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.

The training model unit 340 is configured to optimize model parameters by using the speech data as training content, and use the word error rate as an optimization target training model.

As a specific embodiment, the unit takes the predicted speech data as the input of the speech recognition network model, trains and learns the parameters of the speech recognition network model, and evaluates the speech recognition network model by a WER (Word Error Rate), in order to keep the recognized Word sequence consistent with the standard Word sequence, some words need to be replaced, deleted or inserted, and the percentage of the total number of the inserted, replaced or deleted words divided by the total number of words in the standard Word sequence is the WER. The calculation formula is as follows:

wherein S is the number of replacement, D is the number of deletion, I is the number of insertion, and N is the total number of Chinese characters. Through the evaluation, the content of the whole model is continuously corrected so as to realize an ideal output result of the model.

Fig. 5 is a schematic structural diagram of an identification device according to an embodiment of the present application. The object detection apparatus 4000 comprises a processor 41 and may further comprise an input device 42, an output device 43 and a memory 44. The input device 42, the output device 43, the memory 44, and the processor 41 are connected to each other via a bus.

The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.

The input means are for inputting data and/or signals and the output means are for outputting data and/or signals. The output means and the input means may be separate devices or may be an integral device.

The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for accelerated processing.

The memory is used to store program codes and data of the network device.

The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 5 only shows a simplified design of the object detection device. In practical applications, the motion recognition devices may also respectively include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all motion recognition devices that can implement the embodiments of the present application are within the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).

Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the foregoing embodiments, and various equivalent changes (such as number, shape, position, etc.) may be made to the technical solution of the present invention within the technical spirit of the present invention, and the equivalents are protected by the present invention.

Claims

1. An end-to-end speech recognition method, characterized by: the method comprises the following steps:

2. The end-to-end speech recognition method of claim 1,

the step of performing unified normalization processing on the whole voice data and then segmenting the voice data according to the database tags comprises the following steps:

3. The end-to-end speech recognition method of claim 1,

the attention-drawing mechanism, which is combined with the convolutional neural network, comprises:

and introducing an attention mechanism into the convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A passes Softmax and accords with a probability distribution value interval.

4. The end-to-end speech recognition method of claim 1,

the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.

5. An end-to-end speech recognition system, comprising:

6. The end-to-end speech recognition system of claim 5,

the voice receiving unit includes: normalizing the range of the whole voice to a threshold range taking the O point as a symmetric center, wherein the physical meanings of the whole voice before and after normalization expressed at the position where the numerical value is zero are all unvoiced segments.

7. The end-to-end speech recognition system of claim 6,

the spectrum acquisition unit includes:

8. The end-to-end speech recognition system of claim 7,

the model building unit comprises:

9. An identification device comprising a memory having computer-executable instructions stored thereon and a processor that when executed by the computer-executable instructions on the memory performs the method of any one of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 4.