CN110767218A - End-to-end speech recognition method, system, device and storage medium thereof - Google Patents

End-to-end speech recognition method, system, device and storage medium thereof Download PDF

Info

Publication number
CN110767218A
CN110767218A CN201911057703.7A CN201911057703A CN110767218A CN 110767218 A CN110767218 A CN 110767218A CN 201911057703 A CN201911057703 A CN 201911057703A CN 110767218 A CN110767218 A CN 110767218A
Authority
CN
China
Prior art keywords
voice
model
speech recognition
attention mechanism
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911057703.7A
Other languages
Chinese (zh)
Inventor
李浩然
颜丙聪
赵力
张玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Lizhi Psychological Big Data Industry Research Institute Co Ltd
Original Assignee
Nanjing Lizhi Psychological Big Data Industry Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Lizhi Psychological Big Data Industry Research Institute Co Ltd filed Critical Nanjing Lizhi Psychological Big Data Industry Research Institute Co Ltd
Priority to CN201911057703.7A priority Critical patent/CN110767218A/en
Publication of CN110767218A publication Critical patent/CN110767218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Abstract

The application discloses an end-to-end voice recognition method, a system, a device and a storage medium thereof, wherein the end-to-end voice recognition system based on a convolutional neural network and an attention mechanism is used for realizing deep learning by fusing the attention mechanism into the convolutional neural network and constructing a complete voice recognition network model by using a CTC loss function, and a spectrogram of voice is extracted from original voice data to be used as input of a CNN (voice channel network), so that the voice performance is improved, the information loss caused by manual feature extraction is greatly reduced, and the method has a good application prospect.

Description

End-to-end speech recognition method, system, device and storage medium thereof
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to an end-to-end speech recognition method, system, and apparatus based on a convolutional neural network and an attention mechanism, and a storage medium thereof.
Background
Speech recognition is an active research area in recent years and is an important man-machine interaction means. Typical implementations of speech recognition systems are: the input analog speech signal is first pre-processed, including pre-filtering, sampling and quantization, windowing, endpoint detection, pre-emphasis, etc. After the voice signal is preprocessed, the next important loop is the feature parameter extraction. The features are then learned through algorithms of machine learning and deep learning, such as HMM or LSTM, among others.
The above work has led to the study of speech recognition, but there are some problems worth intensive study, as follows:
(1) recognition of accented (Dialect) speech;
(2) the extraction process from original voice to voice features inevitably leads to information loss, and whether the lost information has influence on the final voice recognition effect is unknown;
(3) the effect of background noise on the recognition effect.
How to overcome the above problems is currently needed.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present application provide an end-to-end speech recognition method, system, apparatus and storage medium thereof based on a convolutional neural network and an attention mechanism.
A first aspect of an embodiment of the present application provides an end-to-end speech recognition method based on a convolutional neural network and an attention mechanism, which may include:
collecting voice data, carrying out unified normalization processing on the whole voice data, and then segmenting the voice data according to a database label;
performing frame windowing on the segmented voice and then acquiring a frequency spectrum by using fast Fourier transform;
introducing an attention mechanism, and combining the attention mechanism with a convolutional neural network to construct a complete speech recognition network model;
and training a voice recognition network model, taking the predicted voice data as the input of the voice recognition network model, training and learning the parameters of the voice recognition network model, and obtaining the required voice recognition network model for recognition after evaluating the word error rate.
Further, the step of performing unified normalization processing on the whole voice data and then segmenting the voice data according to the database tags includes:
normalizing the range of the whole voice to a threshold range taking a point 0 as a symmetry center, wherein the physical meanings of the whole voice before and after normalization, which are expressed at the position where the value is zero, are all unvoiced segments.
Further, the introducing attention mechanism, combining the attention mechanism with the convolutional neural network, includes:
and introducing an attention mechanism into the convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A is subjected to Softmax regression and accords with a probability distribution value interval.
Furthermore, the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.
A second aspect of the embodiments of the present application provides an end-to-end speech recognition system based on a convolutional neural network and an attention mechanism, including:
the voice receiving unit is used for receiving the whole voice and segmenting the voice after normalizing the voice;
the spectrum acquisition unit is used for acquiring spectrum data from the segmented voice data by utilizing Fourier transform;
the model building unit is used for combining an attention mechanism with a convolutional neural network to build a complete speech recognition network model;
and the training model unit is used for optimizing model parameters by using the voice data as training contents and taking the word error rate as an optimization target training model.
Further, the voice receiving unit includes: normalizing the range of the whole voice to a threshold range taking a point 0 as a symmetry center, wherein the physical meanings of the whole voice before and after normalization, which are expressed at the position where the value is zero, are all unvoiced segments.
Further, the spectrum acquisition unit includes:
the window function processing unit is used for performing frame windowing on the voice data obtained by segmentation by using a preset window function;
and the frequency spectrum acquisition unit is used for carrying out fast Fourier transform on the data processed by the window function and only taking half of the length.
Further, the building model unit includes:
introducing an attention mechanism into a convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A passes Softmax and accords with a probability distribution value interval;
the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model framework, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.
In a third aspect, an embodiment of the present application provides an identification apparatus, which includes a memory and a processor, where the memory stores computer-executable instructions, and the processor executes the computer-executable instructions on the memory to implement the method of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method of the first aspect.
In the embodiment of the application, the end-to-end voice recognition system based on the convolutional neural network and the attention mechanism realizes deep learning by fusing the attention mechanism into the convolutional neural network and constructing a complete voice recognition network model by using a CTC loss function, extracts a spectrogram of voice from original voice data as input of a CNN (convolutional neural network), so that the performance of the voice is improved, information loss caused by manual feature extraction is greatly reduced, and the end-to-end voice recognition system has a good application prospect.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of the steps of an end-to-end speech recognition system based on a convolutional neural network and an attention mechanism of the present invention.
FIG. 2 is a schematic flow diagram of FIG. 1;
fig. 3 is a line graph of WER results from testing the model of the present invention on a validation set.
FIG. 4 is a schematic block diagram of an identification system provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of an identification device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Referring to fig. 1, which is a schematic flow chart of an identification method provided in an embodiment of the present application, as shown in the figure, the method may include:
101: and voice data is collected, and the whole voice data is subjected to unified normalization processing and then is segmented according to the database label.
It is understood that the whole speech is determined according to the speech interval time in the collected speech data, the sentence break of the speech is realized by the pause in the dialogue, a continuous speech is taken as the whole speech, the normalization of the data is performed on the whole speech after the speech is collected, the interval attributed to the fact that the 0 point is taken as the center of symmetry, in the embodiment, the normalization range is [ -1, 1], and the physical meaning expressed when the speech value is zero before and after the normalization is unchanged is the unvoiced segment.
And when the voice is cut, the voice subjected to unified normalization processing is cut according to the database label. The database is a professional database which is established by a phoneticist and used for researching voice recognition, in the embodiment, the database is a professional database which is established by Qinghua university and used for researching voice recognition, and is recorded by a single carbon particle microphone in a quiet office environment, and the total time length is more than 30 hours. Most of the people participating in the recording will be college students who speak fluent mandarin. The sampling frequency is 16kHz and the sampling size is 16 bits. And after segmentation according to the database labels, 10000 effective voices are obtained, wherein 500 voices serve as a verification set, 500 voices serve as a test set, the rest voices serve as a training set, the longest L of voice data is 343208, and the duration is about 21.45 seconds.
102: and performing frame windowing on the segmented voice and then acquiring a frequency spectrum by using fast Fourier transform.
It will be appreciated that windowing and framing are both pre-processing stages of speech signal feature extraction. Firstly, dividing frames, then adding windows, and then performing fast Fourier transform.
Framing: in short, a segment of a speech signal is not considered stationary as a whole, but can be considered stationary locally. In the later speech processing, a stationary signal is required to be input, so that the whole speech signal is framed, that is, divided into a plurality of segments. The signal can be considered stable in the range of 10-30ms, and generally has a frame of not less than 20ms, and a frame shift of about 1/2 duration. The frame shift is an overlapping area between two adjacent frames, so as to avoid an excessive change between two adjacent frames.
Windowing: after windowing according to the method, discontinuous places appear at the beginning and the end of each frame, so that the more frames are divided, the larger the error with the original signal is. Windowing is to solve this problem, so that the signal after framing becomes continuous, and each frame will show the characteristics of a periodic function. Hamming windows are typically added in speech signal processing.
As a specific embodiment, the segmented speech is windowed and framed;
in the framing processing, the frame length I is 1024, the interframe overlapping rate p is 25%, and the maximum frame number H is 447.
The added window function is a Hamming window W (N, α), and the calculation formula is as follows, wherein W (N, α) is (1- α) - α cos (2 pi N/(N-1)), N is more than or equal to 0 and less than or equal to N-1, wherein α takes the value of 0.46, and N is the value range of N and represents the length of the Hamming window.
The speech is subjected to fast Fourier transform to obtain a frequency spectrum, and the frequency spectrum is a symmetrical formula, so that the length of the frequency spectrum is only half of the length of the frequency spectrum. The fast fourier transform is formulated as:
Figure BDA0002255667470000082
since this step belongs to a common technical means in speech recognition, it is not repeated.
103: and (4) introducing an attention mechanism, and combining the attention mechanism with the convolutional neural network to construct a complete speech recognition network model.
It can be understood that, in the present application, an attention mechanism is introduced into the convolutional neural network, and the attention mechanism is implemented by multiplying two fully-connected layers a and B, where the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution value that is obtained after the weight of a passes Softmax and conforms to a probability distribution value interval.
In the construction process of the voice recognition network model, a CNN + CTC model is adopted, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers are adopted, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.
The convolutional layer is used for extracting features of the spectrogram, the pooling layer is used for further extracting main features and reducing parameters, and part of neurons are randomly discarded by dropout after each pooling layer, so that the network is prevented from being over-trained. After the convolutional layer and the pooling layer, the images are compressed into a form which can be input by the fully connected layer by using a reshape layer, then an attention mechanism of weight is introduced in a form of multiplication of the fully connected layer, and then classification is realized by the fully connected layer. Other network parameter settings are shown in table 1:
parameter(s) Value of
Initial learning rate 0.0001
Training batch size 32
Interlayer cell connection rate (dropout) 0.6
Convolution output channel 2
104: and training a voice recognition network model, taking the predicted voice data as the input of the voice recognition network model, training and learning the parameters of the voice recognition network model, and obtaining the required voice recognition network model for recognition after evaluating the word error rate.
It can be understood that after the model is built, the data parameters need to be continuously modified through a large amount of data training, so that the model is more suitable for the applicable object, and the voice data can be accurately output into the text data in the actual use.
As a specific embodiment, when training the speech recognition network model, the predicted speech data is used as the input of the speech recognition network model, the parameters of the speech recognition network model are trained and learned, and the WER (word error rate, WordError) is passedRate), in order to make the recognized word sequence and the standard word sequence consistent, some words need to be replaced, deleted or inserted, and the total number of the inserted, replaced or deleted words is divided by the percentage of the total number of the words in the standard word sequence, namely the WER. The calculation formula is as follows:
Figure BDA0002255667470000091
wherein S is the number of replacement, D is the number of deletion, I is the number of insertion, and N is the total number of Chinese characters.
Through the evaluation, the content of the whole model is continuously corrected so as to realize an ideal output result of the model.
In the specific training process, every 200 times of training, performing one verification on the verification set, recording the WER of the verification set, finally counting the result as shown in FIG. 3 in the verification set, with the superposition of the training times steps, the overall WER finally converges to 20.35%, and finally the WER of 19.80% is obtained in the test set,
to sum up, the end-to-end speech recognition system based on the convolutional neural network and the attention mechanism of the present invention realizes deep learning by fusing the attention mechanism into the convolutional neural network and constructing a complete speech recognition network model by using a CTC loss function, extracts a speech spectrogram of speech from original speech data as input of CNN, so as to improve speech performance, greatly reduce information loss caused by manual feature extraction, and have good application prospects.
Embodiments of the present application also provide an end-to-end speech recognition system based on a convolutional neural network and an attention mechanism, the system being configured to perform any of the above. Specifically, referring to fig. 4, fig. 4 is a schematic block diagram of a positioning apparatus provided in an embodiment of the present application. The device of the embodiment comprises: speech receiving unit 310, spectrum obtaining unit 320, model building unit 330, and model training unit 340.
The voice receiving unit 310 is configured to receive a whole segment of voice, normalize the segment of voice, and then segment the normalized segment of voice.
And a spectrum obtaining unit 320, configured to obtain spectrum data from the segmented voice data by using fourier transform.
And a model building unit 330 for combining the attention mechanism with the convolutional neural network to build a complete speech recognition network model.
And a training model unit 340, configured to optimize model parameters by using the speech data as training content, and train a model with the word error rate as an optimization target.
The speech receiving unit 310 normalizes the range of the whole speech to the threshold range with 0 point as the center of symmetry, wherein the physical meanings of the whole speech before and after normalization expressed at the position where the value is zero are all unvoiced segments.
As an alternative embodiment, the normalization range is [ -1, 1], and the physical meanings expressed when the speech value is zero before and after normalization are unchanged, and are silent sections.
And when the voice is cut, the voice subjected to unified normalization processing is cut according to the database label. The database is a professional database which is established by a phoneticist and used for researching voice recognition, in the embodiment, the database is a professional database which is established by Qinghua university and used for researching voice recognition, and is recorded by a single carbon particle microphone in a quiet office environment, and the total time length is more than 30 hours. Most of the people participating in the recording will be college students who speak fluent mandarin. The sampling frequency is 16kHz and the sampling size is 16 bits. And after segmentation according to the database labels, 10000 effective voices are obtained, wherein 500 voices serve as a verification set, 500 voices serve as a test set, the rest voices serve as a training set, the longest L of voice data is 343208, and the duration is about 21.45 seconds.
The spectrum obtaining unit 320 is specifically configured to perform frame windowing on the segmented speech and then obtain a spectrum by using fast fourier transform.
As an optional implementation manner, the spectrum obtaining unit 320 includes:
the framing unit 321 determines the number of frames of the segmented speech. In this embodiment, the frame length I in the framing processing is 1024, the interframe overlapping rate p is 25%, and the maximum frame number H is 447.
Figure BDA0002255667470000111
After windowing by the windowing unit 322, discontinuous portions appear at the beginning and end of each frame, so that the more frames are, the larger the error with the original signal is. Windowing is to solve this problem, so that the signal after framing becomes continuous, and each frame will show the characteristics of a periodic function. Hamming windows are typically added in speech signal processing.
In this embodiment, the added window function is a Hamming window W (N, α), and the calculation formula is as follows, where W (N, α) ═ 1- α - α cos (2 pi N/(N-1)), 0 ≦ N-1, where α takes a value of 0.46, and N is a value range of N, which represents the length of the Hamming window.
Fast fourier transform section 323 performs fast fourier transform on the speech to obtain a spectrum, and the spectrum is symmetric and therefore only half of the length. The fast fourier transform is formulated as:
Figure BDA0002255667470000121
the model building unit 330 is used to combine the attention mechanism with the convolutional neural network to build a complete speech recognition network model.
It can be understood that, in the present application, an attention mechanism is introduced into the convolutional neural network, and the attention mechanism is implemented by multiplying two fully-connected layers a and B, where the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution value that is obtained after the weight of a passes Softmax and conforms to a probability distribution value interval. In the construction process of the voice recognition network model, a CNN + CTC model is adopted, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers are adopted, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.
The training model unit 340 is configured to optimize model parameters by using the speech data as training content, and use the word error rate as an optimization target training model.
As a specific embodiment, the unit takes the predicted speech data as the input of the speech recognition network model, trains and learns the parameters of the speech recognition network model, and evaluates the speech recognition network model by a WER (Word Error Rate), in order to keep the recognized Word sequence consistent with the standard Word sequence, some words need to be replaced, deleted or inserted, and the percentage of the total number of the inserted, replaced or deleted words divided by the total number of words in the standard Word sequence is the WER. The calculation formula is as follows:
Figure BDA0002255667470000131
wherein S is the number of replacement, D is the number of deletion, I is the number of insertion, and N is the total number of Chinese characters. Through the evaluation, the content of the whole model is continuously corrected so as to realize an ideal output result of the model.
Fig. 5 is a schematic structural diagram of an identification device according to an embodiment of the present application. The object detection apparatus 4000 comprises a processor 41 and may further comprise an input device 42, an output device 43 and a memory 44. The input device 42, the output device 43, the memory 44, and the processor 41 are connected to each other via a bus.
The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.
The input means are for inputting data and/or signals and the output means are for outputting data and/or signals. The output means and the input means may be separate devices or may be an integral device.
The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for accelerated processing.
The memory is used to store program codes and data of the network device.
The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.
It will be appreciated that fig. 5 only shows a simplified design of the object detection device. In practical applications, the motion recognition devices may also respectively include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all motion recognition devices that can implement the embodiments of the present application are within the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).
Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the foregoing embodiments, and various equivalent changes (such as number, shape, position, etc.) may be made to the technical solution of the present invention within the technical spirit of the present invention, and the equivalents are protected by the present invention.

Claims (10)

1. An end-to-end speech recognition method, characterized by: the method comprises the following steps:
collecting voice data, carrying out unified normalization processing on the whole voice data, and then segmenting the voice data according to a database label;
performing frame windowing on the segmented voice and then acquiring a frequency spectrum by using fast Fourier transform;
introducing an attention mechanism, and combining the attention mechanism with a convolutional neural network to construct a complete speech recognition network model;
and training a voice recognition network model, taking the predicted voice data as the input of the voice recognition network model, training and learning the parameters of the voice recognition network model, and obtaining the required voice recognition network model for recognition after evaluating the word error rate.
2. The end-to-end speech recognition method of claim 1,
the step of performing unified normalization processing on the whole voice data and then segmenting the voice data according to the database tags comprises the following steps:
normalizing the range of the whole voice to a threshold range taking a point 0 as a symmetry center, wherein the physical meanings of the whole voice before and after normalization, which are expressed at the position where the value is zero, are all unvoiced segments.
3. The end-to-end speech recognition method of claim 1,
the attention-drawing mechanism, which is combined with the convolutional neural network, comprises:
and introducing an attention mechanism into the convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A passes Softmax and accords with a probability distribution value interval.
4. The end-to-end speech recognition method of claim 1,
the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model architecture, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.
5. An end-to-end speech recognition system, comprising:
the voice receiving unit is used for receiving the whole voice and segmenting the voice after normalizing the voice;
the spectrum acquisition unit is used for acquiring spectrum data from the segmented voice data by utilizing Fourier transform;
the model building unit is used for combining an attention mechanism with a convolutional neural network to build a complete speech recognition network model;
and the training model unit is used for optimizing model parameters by using the voice data as training contents and taking the word error rate as an optimization target training model.
6. The end-to-end speech recognition system of claim 5,
the voice receiving unit includes: normalizing the range of the whole voice to a threshold range taking the O point as a symmetric center, wherein the physical meanings of the whole voice before and after normalization expressed at the position where the numerical value is zero are all unvoiced segments.
7. The end-to-end speech recognition system of claim 6,
the spectrum acquisition unit includes:
the window function processing unit is used for performing frame windowing on the voice data obtained by segmentation by using a preset window function;
and the frequency spectrum acquisition unit is used for carrying out fast Fourier transform on the data processed by the window function and only taking half of the length.
8. The end-to-end speech recognition system of claim 7,
the model building unit comprises:
introducing an attention mechanism into a convolutional neural network, wherein the attention mechanism is realized by multiplying two fully-connected layers A and B, the fully-connected layer B is used as an attention weight, and the weight is an attention distribution probability distribution numerical value which is obtained after the weight of A passes Softmax and accords with a probability distribution value interval;
the voice recognition network model adopts a CNN + CTC model, a VGG16 basic model framework, 10 convolutional layers, 5 pooling layers and 5 full-connection layers, wherein the three full-connection layers are used for realizing an attention mechanism, a CTC loss function is adopted as a loss function, and an Adam optimizer is adopted as a network optimizer.
9. An identification device comprising a memory having computer-executable instructions stored thereon and a processor that when executed by the computer-executable instructions on the memory performs the method of any one of claims 1 to 4.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 4.
CN201911057703.7A 2019-10-31 2019-10-31 End-to-end speech recognition method, system, device and storage medium thereof Pending CN110767218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911057703.7A CN110767218A (en) 2019-10-31 2019-10-31 End-to-end speech recognition method, system, device and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911057703.7A CN110767218A (en) 2019-10-31 2019-10-31 End-to-end speech recognition method, system, device and storage medium thereof

Publications (1)

Publication Number Publication Date
CN110767218A true CN110767218A (en) 2020-02-07

Family

ID=69335600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911057703.7A Pending CN110767218A (en) 2019-10-31 2019-10-31 End-to-end speech recognition method, system, device and storage medium thereof

Country Status (1)

Country Link
CN (1) CN110767218A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179918A (en) * 2020-02-20 2020-05-19 中国科学院声学研究所 Joint meaning time classification and truncation type attention combined online voice recognition technology
CN111508487A (en) * 2020-04-13 2020-08-07 深圳市友杰智新科技有限公司 Feature extraction method and voice command recognition method based on expansion mechanism
CN111508493A (en) * 2020-04-20 2020-08-07 Oppo广东移动通信有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN111824879A (en) * 2020-07-02 2020-10-27 南京安杰信息科技有限公司 Intelligent voice contactless elevator control method, system and storage medium
CN111933154A (en) * 2020-07-16 2020-11-13 平安科技(深圳)有限公司 Method and device for identifying counterfeit voice and computer readable storage medium
CN112489677A (en) * 2020-11-20 2021-03-12 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
CN112786019A (en) * 2021-01-04 2021-05-11 中国人民解放军32050部队 System and method for realizing voice transcription through image recognition mode
CN112967739A (en) * 2021-02-26 2021-06-15 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN113127622A (en) * 2021-04-29 2021-07-16 西北师范大学 Method and system for generating voice to image
CN113257240A (en) * 2020-10-30 2021-08-13 国网天津市电力公司 End-to-end voice recognition method based on countermeasure training
CN113327590A (en) * 2021-04-15 2021-08-31 中标软件有限公司 Speech recognition method
CN113409827A (en) * 2021-06-17 2021-09-17 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on local convolution block attention network
CN113422875A (en) * 2021-06-22 2021-09-21 中国银行股份有限公司 Voice seat response method, device, equipment and storage medium
CN115294973A (en) * 2022-09-30 2022-11-04 云南师范大学 Va-language isolated vocabulary identification method based on convolutional neural network and attention mechanism

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN108520753A (en) * 2018-02-26 2018-09-11 南京工程学院 Voice lie detection method based on the two-way length of convolution memory network in short-term
CN109215662A (en) * 2018-09-18 2019-01-15 平安科技(深圳)有限公司 End-to-end audio recognition method, electronic device and computer readable storage medium
CN110164416A (en) * 2018-12-07 2019-08-23 腾讯科技(深圳)有限公司 A kind of audio recognition method and its device, equipment and storage medium
CN110176228A (en) * 2019-05-29 2019-08-27 广州伟宏智能科技有限公司 A kind of small corpus audio recognition method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN108520753A (en) * 2018-02-26 2018-09-11 南京工程学院 Voice lie detection method based on the two-way length of convolution memory network in short-term
CN109215662A (en) * 2018-09-18 2019-01-15 平安科技(深圳)有限公司 End-to-end audio recognition method, electronic device and computer readable storage medium
CN110164416A (en) * 2018-12-07 2019-08-23 腾讯科技(深圳)有限公司 A kind of audio recognition method and its device, equipment and storage medium
CN110176228A (en) * 2019-05-29 2019-08-27 广州伟宏智能科技有限公司 A kind of small corpus audio recognition method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GONG SHUAI ET AL.: "A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks", 《2019 CHINESE CONTROL CONFERENCE (CCC)》 *
龙星延: "基于注意力机制的端到端语音识别技术研究", 《中国硕士电子期刊》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179918A (en) * 2020-02-20 2020-05-19 中国科学院声学研究所 Joint meaning time classification and truncation type attention combined online voice recognition technology
CN111508487A (en) * 2020-04-13 2020-08-07 深圳市友杰智新科技有限公司 Feature extraction method and voice command recognition method based on expansion mechanism
CN111508487B (en) * 2020-04-13 2023-07-18 深圳市友杰智新科技有限公司 Feature extraction method and voice command recognition method based on expansion mechanism
CN111508493A (en) * 2020-04-20 2020-08-07 Oppo广东移动通信有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN111508493B (en) * 2020-04-20 2022-11-15 Oppo广东移动通信有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN111824879A (en) * 2020-07-02 2020-10-27 南京安杰信息科技有限公司 Intelligent voice contactless elevator control method, system and storage medium
CN111824879B (en) * 2020-07-02 2021-03-30 南京安杰信息科技有限公司 Intelligent voice contactless elevator control method, system and storage medium
CN111933154A (en) * 2020-07-16 2020-11-13 平安科技(深圳)有限公司 Method and device for identifying counterfeit voice and computer readable storage medium
CN111933154B (en) * 2020-07-16 2024-02-13 平安科技(深圳)有限公司 Method, equipment and computer readable storage medium for recognizing fake voice
CN113257240A (en) * 2020-10-30 2021-08-13 国网天津市电力公司 End-to-end voice recognition method based on countermeasure training
CN112489677A (en) * 2020-11-20 2021-03-12 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
WO2021208728A1 (en) * 2020-11-20 2021-10-21 平安科技(深圳)有限公司 Method and apparatus for speech endpoint detection based on neural network, device, and medium
CN112489677B (en) * 2020-11-20 2023-09-22 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
CN112786019A (en) * 2021-01-04 2021-05-11 中国人民解放军32050部队 System and method for realizing voice transcription through image recognition mode
CN112967739B (en) * 2021-02-26 2022-09-06 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN112967739A (en) * 2021-02-26 2021-06-15 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN113327590A (en) * 2021-04-15 2021-08-31 中标软件有限公司 Speech recognition method
CN113127622A (en) * 2021-04-29 2021-07-16 西北师范大学 Method and system for generating voice to image
CN113409827A (en) * 2021-06-17 2021-09-17 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on local convolution block attention network
CN113422875A (en) * 2021-06-22 2021-09-21 中国银行股份有限公司 Voice seat response method, device, equipment and storage medium
CN113422875B (en) * 2021-06-22 2022-11-25 中国银行股份有限公司 Voice seat response method, device, equipment and storage medium
CN115294973A (en) * 2022-09-30 2022-11-04 云南师范大学 Va-language isolated vocabulary identification method based on convolutional neural network and attention mechanism

Similar Documents

Publication Publication Date Title
CN110767218A (en) End-to-end speech recognition method, system, device and storage medium thereof
CN111402891B (en) Speech recognition method, device, equipment and storage medium
CN110909613A (en) Video character recognition method and device, storage medium and electronic equipment
CN107492382A (en) Voiceprint extracting method and device based on neutral net
CN103971675A (en) Automatic voice recognizing method and system
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN111833845A (en) Multi-language speech recognition model training method, device, equipment and storage medium
CN103871424A (en) Online speaking people cluster analysis method based on bayesian information criterion
JP6875819B2 (en) Acoustic model input data normalization device and method, and voice recognition device
CN111081219A (en) End-to-end voice intention recognition method
CN112735385A (en) Voice endpoint detection method and device, computer equipment and storage medium
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN105869622B (en) Chinese hot word detection method and device
CN117115581A (en) Intelligent misoperation early warning method and system based on multi-mode deep learning
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
Elbarougy Speech emotion recognition based on voiced emotion unit
CN110853669B (en) Audio identification method, device and equipment
US20220277732A1 (en) Method and apparatus for training speech recognition model, electronic device and storage medium
CN115831125A (en) Speech recognition method, device, equipment, storage medium and product
CN112863518B (en) Method and device for recognizing voice data subject
CN115050350A (en) Label checking method and related device, electronic equipment and storage medium
CN113158669B (en) Method and system for identifying positive and negative comments of employment platform
CN110660384B (en) Mongolian special-shaped homophone acoustic modeling method based on end-to-end
CN114841143A (en) Voice room quality evaluation method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200207