CN111667835A

CN111667835A - Voice recognition method, living body detection method, model training method and device

Info

Publication number: CN111667835A
Application number: CN202010493390.6A
Authority: CN
Inventors: 赵幸福; 蒋宁; 赵立军
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd; Mashang Consumer Finance Co Ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-09-15

Abstract

The invention provides a voice recognition method, a living body detection method, a model training method and a device, wherein the method comprises the following steps: extracting the sound spectrum characteristics of the voice signal to be recognized; inputting the characteristic information obtained after the voice spectrum characteristic is input into the stacked convolutional neural network for processing, inputting the characteristic information obtained after the cyclic neural network processing into a sequence network coding and decoding, and outputting a character sequence corresponding to the voice signal to obtain a recognition result; the recurrent neural network comprises a bidirectional gated recurrent unit network or a long-short term memory network. The voice recognition method provided by the invention can realize end-to-end voice recognition, not only has higher speed of voice recognition, but also can improve the accuracy of preset character voice recognition.

Description

Voice recognition method, living body detection method, model training method and device

Technical Field

The invention relates to the technical field of information processing, in particular to a voice recognition method, a living body detection method, a model training method and a model training device.

Background

With the development of electronic technology and natural language processing technology, the application of speech recognition technology is more and more extensive. An existing speech recognition system is often suitable for recognizing all types of characters (such as characters of Chinese characters, letters, numbers and the like), and in order to recognize different types of characters, the structure is often complex, and the recognition speed is slow, for example, a speech signal to be recognized is framed according to a preset length, speech features are extracted, the extracted speech features are input into a trained phoneme acoustic model, a phoneme result is obtained, and the phoneme result is input into a language model, so that a speech recognition result is obtained.

However, there are some scenarios that only require voice recognition of preset characters, such as password input, verification code input, etc., and usually only require recognition of numeric characters or recognition of alphabetic characters, in which case, if the universal voice recognition system is used for voice recognition, not only the speed is slow, but also the accuracy of voice recognition is poor.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a living body detection method, a model training method and a model training device, and aims to solve the problem that the accuracy of preset character voice recognition is poor in the prior art.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a speech recognition method. The method comprises the following steps:

extracting the sound spectrum characteristics of the voice signal to be recognized;

inputting the characteristic information obtained after the voice spectrum characteristic is input into the stacked convolutional neural network for processing, inputting the characteristic information obtained after the cyclic neural network processing into a sequence network coding and decoding, and outputting a character sequence corresponding to the voice signal to obtain a recognition result;

wherein the recurrent neural network comprises a bidirectional gated recurrent unit network or a long-short term memory network.

In a second aspect, embodiments of the present invention provide a method for detecting a living organism. The method comprises the following steps:

collecting a voice signal of a target character sequence read by an object to be detected and a video signal containing lips, wherein characters in the target character sequence are preset characters;

recognizing the voice signal by using the voice recognition method to obtain a first character sequence corresponding to the voice signal;

performing lip language identification on the video signal to obtain a second character sequence corresponding to the video signal;

and judging whether the object is a living body according to the first character sequence and the second character sequence.

In a third aspect, an embodiment of the present invention provides a model training method. The method comprises the following steps:

acquiring N voice samples, wherein the voice samples are voice samples corresponding to preset characters, and N is a positive integer;

respectively extracting the sound spectrum characteristics of each voice sample in the N voice samples;

training a target network according to the sound spectrum characteristics of the N voice samples to obtain a voice recognition model;

the target network comprises a stacked convolutional neural network, a cyclic neural network and a sequence-to-sequence network, the characteristic information output by the stacked convolutional neural network is input into the cyclic neural network, the characteristic information output by the cyclic neural network is input into the sequence-to-sequence network, and the cyclic neural network comprises a bidirectional gated cyclic unit network or a long-short term memory network.

In a fourth aspect, an embodiment of the present invention further provides a speech recognition apparatus. The speech recognition apparatus includes:

the extraction module is used for extracting the sound spectrum characteristics of the voice signal to be recognized;

the recognition module is used for inputting the feature information obtained after the sound spectrum features are input into the stacked convolutional neural network for processing into a cyclic neural network, inputting the feature information obtained after the cyclic neural network is processed into a sequence network for encoding and decoding, and outputting a character sequence corresponding to the voice signal to obtain a recognition result;

In a fifth aspect, an embodiment of the present invention further provides a living body detection apparatus. The living body detecting device includes:

the device comprises an acquisition module, a detection module and a display module, wherein the acquisition module is used for acquiring a voice signal of a target character sequence read by an object to be detected and a video signal containing lips, and characters in the target character sequence are preset characters;

the first recognition module is used for recognizing the voice signal by using the voice recognition method to obtain a first character sequence corresponding to the voice signal;

the second identification module is used for carrying out lip language identification on the video signal to obtain a second character sequence corresponding to the video signal;

and the judging module is used for judging whether the object is a living body according to the first character sequence and the second character sequence.

In a sixth aspect, an embodiment of the present invention further provides a model training apparatus. The model training device includes:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring N voice samples, the voice samples are voice samples corresponding to preset characters, and N is a positive integer;

the extraction module is used for respectively extracting the sound spectrum characteristics of each voice sample in the N voice samples;

the training module is used for training a target network according to the sound spectrum characteristics of the N voice samples to obtain a voice recognition model;

In a seventh aspect, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and being executable on the processor, where the computer program, when executed by the processor, implements the steps of the above-mentioned speech recognition method, or implements the steps of the above-mentioned living body detection method, or implements the steps of the above-mentioned model training method.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned speech recognition method are implemented, or the steps of the above-mentioned living body detection method are implemented, or the steps of the above-mentioned model training method are implemented.

In the embodiment of the invention, the voice recognition end-to-end can be realized by extracting the voice spectrum characteristics of the voice signal to be recognized, inputting the characteristic information obtained after the voice spectrum characteristics are input into the stacked convolutional neural network for processing, inputting the characteristic information obtained after the cyclic neural network processing into the sequence network for coding and decoding, and outputting the character sequence corresponding to the voice signal to obtain the recognition result.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a speech recognition method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a target network provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of a speech recognition module provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a method for detecting a living body according to an embodiment of the present invention;

FIG. 5 is a flow chart of a model training method provided by an embodiment of the invention;

FIG. 6 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 7 is a structural view of a living body detecting apparatus provided in an embodiment of the present invention;

FIG. 8 is a block diagram of a model training apparatus according to an embodiment of the present invention;

fig. 9 is a block diagram of a voice recognition apparatus according to still another embodiment of the present invention;

FIG. 10 is a structural view of a living body detecting apparatus according to still another embodiment of the present invention;

fig. 11 is a block diagram of a model training apparatus according to still another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a voice recognition method. Referring to fig. 1, fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, extracting the sound spectrum characteristics of the voice signal to be recognized.

In this embodiment, the voice signal to be recognized may be any voice signal, for example, the voice signal to be recognized may be a voice signal corresponding to a preset character. The preset characters may refer to preset types of characters, such as numeric characters, alphabetical characters, and the like; the preset character set may include S characters, the number of S is less than or equal to a preset value, the preset value may be reasonably set according to actual requirements, but the value of the preset value should not be too large, for example, the preset value may be 100, 50, 30, and the like.

The voice signal corresponding to the preset character can be understood as a voice signal obtained by reading a character sequence composed of the preset character by a collected user, for example, in the case that the preset character is a numeric character, the voice signal may be a voice signal of a read character sequence of the collected user.

The spectral feature may be a Mel-Frequency spectral feature, and for example, may be a Mel-Frequency spectrum, a Mel-Frequency Cepstrum Coefficient (MFCC), a log Mel-Frequency energy (LMFE), or the like.

102, inputting the characteristic information obtained after the voice spectrum characteristic is input into a stacked convolutional neural network for processing into a cyclic neural network for processing, inputting a sequence of the characteristic information obtained after the cyclic neural network is processed into a sequence network for encoding and decoding, and outputting a character sequence corresponding to the voice signal to obtain an identification result;

the stacked convolutional neural network, the cyclic neural network and the sequence-to-sequence network form a speech recognition model of the embodiment, the speech recognition model is obtained by training a target network based on a speech sample corresponding to a preset character, and the cyclic neural network comprises a bidirectional gated cyclic unit network or a long-short term memory network.

In this embodiment, the target network may include a convolutional neural network, a cyclic neural network, and a sequence-to-sequence network, which are sequentially connected and stacked.

Optionally, the stacked convolutional neural network performs enhancement processing on the frequency domain features in the input sound spectrum features; the cyclic neural network performs enhancement processing on time characteristic information in the input sound spectrum characteristic processed by the stacked convolutional neural network; and the sequence-to-sequence network encodes and decodes the input sound spectrum features processed by the recurrent neural network and outputs a character sequence.

The above-mentioned stacked convolutional neural network may be understood as a convolutional neural network including a plurality of convolutional layers stacked or connected in series, and is used for extracting frequency domain feature information based on input acoustic spectrum features. For example, the stacked convolutional neural networks may include, but are not limited to, inclusion networks, residual networks (i.e., ResNet), densely connected convolutional networks (i.e., DenseNet), custom convolutional neural networks, and the like.

Optionally, the output end of the convolutional layer of the stacked convolutional neural network is connected to a pooling layer, in this embodiment, in the pooling layer, the dimension of time is not reduced, and only the dimension of frequency is reduced, so as to reduce the dimension of the frequency domain feature information extracted by the convolutional layer, and the dimension of the time feature information remains unchanged, that is, the time feature information is not destroyed. For example, the above-mentioned stacked convolutional neural network is a custom convolutional neural network, which includes a convolutional neural network of four convolutional layers, and the output end of each convolutional layer is connected with a pooling layer. In this embodiment, the stacked convolutional neural network can enhance not only the extraction of the frequency domain feature information of the speech signal in the speech recognition process, but also the extraction of the time feature information by the subsequent cyclic neural network.

The Recurrent neural network may include a bidirectional Gated Recurrent Unit (GRU) network or a Long-Short Term Memory (LSTM) network, and is configured to extract time feature information based on frequency domain feature information output by the stacked Recurrent neural network. Optionally, the number of layers of the recurrent neural network may be in a range of [1, 3 ]. The embodiment can enhance the extraction of the time characteristic information of the voice signal in the voice recognition process based on the recurrent neural network.

The Sequence-to-Sequence (Seq 2Seq) network may include an encoding layer and a decoding layer connected in Sequence, and is configured to encode and decode based on the sonographic features output by the recurrent neural network, output a speech recognition result, that is, a character Sequence, and may implement end-to-end conversion of a speech signal into a text.

Optionally, the Seq2Seq network may further include an attention layer, and the encoding result output by the encoding layer is processed by the attention layer and then input to the decoding layer. The attention layer, that is, the attention layer, may assign different weights to the coding results output by the coding layer at each coding time step based on an attention mechanism, and then input the coding results to the decoding layer, so that each decoding time step of the decoding layer may pay more attention to the feature information related to the character that needs to be recognized currently, and further, the accuracy of the recognition result may be improved.

In summary, the target network for speech recognition provided in this embodiment can enhance the extraction of frequency domain feature information of a speech signal in the speech recognition process through the stacked convolutional neural networks, enhance the extraction of time feature information of the speech signal in the speech recognition process through the cyclic neural networks, and implement end-to-end conversion from the speech signal to a text through the Seq2Seq network, so that the speech recognition accuracy of a speech recognition model obtained based on the training of the target network is high, and end-to-end speech recognition can be implemented; because the voice recognition model in the embodiment is a model obtained by training the target network based on the voice sample corresponding to the preset character, compared with the traditional voice recognition model suitable for all characters, the voice recognition model has the advantages of simple structure, small volume, convenience in deployment and the like, has pertinence to the preset character, and realizes the effect of higher recognition accuracy rate aiming at the preset character.

For example, as shown in fig. 2, the target network includes a stacked convolutional neural network, a two-layer bidirectional GRU network, and a Seq2Seq network connected in sequence, wherein an input end of the stacked convolutional neural network is used for inputting the voice spectrum features, and an output end of the Seq2Seq network is used for outputting the voice recognition result, that is, the character sequence.

The voice sample corresponding to the preset character can be understood as a voice sample corresponding to a preset type of character, wherein the preset type of character can include numeric characters, alphabetical characters, and the like, and for example, a voice signal of a read word sequence can be collected as a voice sample; the method can also be understood as a voice sample corresponding to a character in a preset character set, for example, a voice signal of a character sequence composed of characters in the preset character set read by a user can be collected as the voice sample.

In practical application, a large number of voice signals for reading a character sequence composed of preset characters can be collected and used as voice samples to train the target network, and a voice recognition model aiming at preset character recognition is obtained. Compared with the voice recognition system suitable for all types of character recognition in the prior art, the voice recognition model of the embodiment has a simpler network structure, and the preset characters are faster in recognition speed and higher in recognition accuracy.

According to the voice recognition method provided by the embodiment, the voice spectrum characteristics of the voice signal to be recognized are extracted and input into the voice recognition model to recognize the character sequence corresponding to the voice signal, and the voice recognition model is a model obtained by training the target network based on the voice sample corresponding to the preset character, and the target network comprises the stacked convolutional neural network, the cyclic neural network and the sequence-to-sequence network, so that the end-to-end voice recognition can be realized, the voice recognition speed is high, and the accuracy of the voice recognition of the preset character can be improved.

Alternatively, the number of layers of the recurrent neural network may be 2.

In this embodiment, the time characteristic information of the speech signal to be recognized may be extracted based on two layers of bidirectional GRUs or two layers of LSTM, and the convergence rate of the speech recognition model training may be considered while ensuring that the time characteristic information is extracted relatively abundantly, so as to improve the accuracy of the speech recognition model recognition result obtained by the training.

Optionally, the stacked convolutional neural network is a residual network.

In the embodiment, richer frequency domain characteristic information can be extracted based on ResNet, so that the accuracy of the recognition result of the voice recognition model can be improved.

Optionally, the sound spectrum feature is a log mel-frequency energy LMFE feature.

In this embodiment, the extracting of the LMFE feature may include the following steps:

step a, carrying out pre-emphasis, framing and windowing processing on the input voice signal in sequence.

In this step, the frame length for framing processing and the window length for windowing processing may be set reasonably according to actual requirements, for example, the frame length may be 20ms, and the window length may be 512.

And b, performing Fast Fourier Transform (FFT) on the windowed signal.

And c, taking an absolute value or a square value of the signal after the FFT.

And d, carrying out Mel filtering on the signals after the absolute value or the square value is taken.

And e, taking logarithm of the signal after the Mel filtering to obtain the LMFE characteristics.

Optionally, the embodiment may extract the LMFE feature quickly by calling the speedpy module. The speedpy module provides a function of extracting voice features such as MFCC features, Mel Frequency Energy (MFE) features, LMFE features and the like.

In the embodiment, the voice recognition is performed by extracting the LMFE characteristics of the voice signal, and the LMFE characteristics contain richer voice characteristic information, so that the accuracy of the voice recognition can be improved.

The speech recognition method provided by the embodiment of the invention is illustrated below with reference to fig. 3:

referring to fig. 3, the speech signal to be recognized is input to a preprocessing module, which is configured to extract LMFE features of the speech signal to be recognized, for example, a feature matrix of 40 × N × 1, where 40 denotes the number of filters when extracting the LMFE features, N denotes a time length, and is related to a step size, a window length, and a time length of the speech signal when extracting the LMFE features, and 1 denotes the number of channels.

After the LMFE features are obtained, the LMFE features are input to the stacked convolutional neural network to extract frequency domain features, for example, a feature matrix of 40 × N × 1 is input to the stacked convolutional neural network, and a feature matrix of 5 × N × 256 is output. The frequency domain features output by the stacked convolutional neural network are input to a resizing layer (i.e., Reshape layer) to adjust the dimensions of the input frequency domain features so that the adjusted features meet the two-layer bidirectional GRU input requirements, for example, a feature matrix of 5 × N × 256 is input to the resizing layer, and a feature matrix of N × 1280 is output.

Further, the feature output by the resizing layer is input to the dual-layer bidirectional GRU network to extract the temporal feature, wherein the dimension of the extracted temporal feature is related to the number of units (i.e., num _ unit) of the dual-layer bidirectional GRU network, for example, a feature matrix of N × 1280 is input to the dual-layer bidirectional GRU network, and if the number of units is 256, a feature matrix of N × 512 is output. The time characteristics output by the two-layer bidirectional GRU network are input into a Seq2Seq network to output a character sequence, for example, a characteristic matrix of N512 is input into the Seq2Seq network to output the character sequence.

The embodiment of the invention also provides a living body detection method. Referring to fig. 4, fig. 4 is a flowchart of a living body detecting method according to an embodiment of the present invention, as shown in fig. 4, including the following steps:

step 401, collecting a voice signal and a video signal containing lips of a target character sequence read by an object to be detected, wherein characters in the target character sequence are all preset characters.

In this embodiment, the object to be detected may be any user. The characters in the target character sequence are all preset characters, wherein the preset characters can refer to preset types of characters, such as numeric characters, alphabetical characters and the like; the preset character set may include S characters, the number of S is less than or equal to a preset value, the preset value may be reasonably set according to actual requirements, but the value of the preset value should not be too large, for example, the preset value may be 100, 50, 30, and the like.

Optionally, the target character sequence may be a preset character sequence, or may be a randomly generated character sequence. For example, in the case where a living body test is required, a randomly generated number sequence may be displayed, and a voice signal and a video signal including lips may be acquired during the user's reading of the number sequence.

Step 402, recognizing the voice signal by using a voice recognition method to obtain a first character sequence corresponding to the voice signal.

In this step, the speech signal may be recognized based on the speech recognition method provided in any of the above embodiments, so as to obtain a first character sequence corresponding to the speech signal. The relevant content of the voice recognition method can be referred to the foregoing discussion, and is not described herein again.

And 403, performing lip language identification on the video signal to obtain a second character sequence corresponding to the video signal.

For example, lip language recognition may be performed on the acquired video signal based on a pre-trained lip language recognition model, so as to obtain a second character sequence corresponding to the video signal. It should be noted that, in this embodiment, a specific implementation manner of lip language recognition is not limited.

And 404, judging whether the object is a living body according to the first character sequence and the second character sequence.

In this embodiment, the first character sequence and the second character sequence may be compared, and if the similarity between the first character sequence and the second character sequence is greater than the threshold, the object is determined to be a living body, otherwise, the object is determined not to be a living body; or whether the object is a living body may be determined from the first character sequence, the second character sequence, and the target character sequence, for example, whether the object is a living body may be determined based on an alignment result of the second character sequence and the target character sequence in a case where a similarity of the first character sequence and the second character sequence is greater than a threshold value.

In practical applications, lip-based in vivo detection may be assisted by speech recognition. In the process of carrying out the living body detection based on the lip language, the preset characters (such as numbers) are selected as the content of the lip language identification to carry out the living body detection, so that the method is simple and is suitable for users of various cultural classes. In addition, the voice recognition method for the preset characters (such as numbers) provided by any embodiment is adopted to recognize the collected voice signals, so that the machine learning network for voice recognition is simple in structure, convenient to integrate into a module for performing living body detection based on lip language, high in recognition speed and accurate in recognition result, and further capable of improving the speed and accuracy of the living body detection.

Optionally, the determining whether the object is a living body according to the first character sequence and the second character sequence may include:

calculating the similarity of the first character sequence and the second character sequence;

determining that the object is not a living body if the similarity is less than a threshold;

and judging whether the object is a living body according to the comparison result of the first character sequence and the target character sequence under the condition that the similarity is greater than or equal to a threshold value.

In this embodiment, the threshold may be set reasonably according to actual requirements, for example, 90%, 95%, and the like.

Specifically, in the case where the similarity between the first character sequence and the second character sequence is smaller than the threshold, it can be directly determined that the above object has not passed the live body detection; under the condition that the similarity between the first character sequence and the second character sequence is greater than or equal to the threshold, in order to avoid cheating by a user through a pre-recorded video, the first character sequence and the target character sequence can be further compared, and whether the object is a living body or not can be judged according to the comparison result of the first character sequence and the target character sequence.

For example, it may be determined that the object is a living body in a case where the first character sequence and the target character sequence are the same, and otherwise, it is determined that the object is not a living body; or determining that the object is a living body when the number of different characters in the first character sequence and the target character sequence is smaller than a preset numerical value, otherwise determining that the object is not a living body.

It should be noted that, because the first character sequence is a character sequence recognized by the speech recognition method for the preset character provided based on any of the above embodiments, the accuracy is high, and in this embodiment, when the similarity between the first character sequence and the second character sequence is greater than or equal to the threshold, it is determined whether the object is a living body based on the comparison result between the first character sequence and the target character sequence, so that the occurrence of the situation that the object fails to be detected by the living body due to the fact that the preset character recognition is not accurate enough can be reduced.

The embodiment of the invention also provides a model training method, and the speech recognition model of any embodiment can be a model obtained by training based on the model training method provided by the embodiment of the invention. Referring to fig. 5, fig. 5 is a flowchart of a model training method according to an embodiment of the present invention, and as shown in fig. 5, the method includes the following steps:

step 501, obtaining N voice samples, wherein the voice samples are voice samples corresponding to preset characters, and N is a positive integer.

In this embodiment, the voice sample corresponding to the preset character may be understood as a voice sample corresponding to a preset type of character, where the preset type of character may include a numeric character, an alphabetical character, and the like, for example, a voice signal of a read word sequence may be collected as the voice sample; the method can also be understood as a voice sample corresponding to a character in a preset character set, for example, a voice signal of a character sequence composed of characters in the preset character set read by a user can be collected as the voice sample.

And 502, respectively extracting the sound spectrum characteristic of each voice sample in the N voice samples.

In this embodiment, the sound spectrum feature may be a mel-frequency sound spectrum feature, for example, a mel-frequency spectrum, MFCC, LMFE, or the like.

Step 503, training a target network according to the sound spectrum characteristics of the N voice samples to obtain a voice recognition model;

Optionally, the stacked convolutional neural network is configured to perform enhancement processing on a frequency domain feature in the input sound spectrum feature, and the cyclic neural network is configured to perform enhancement processing on time feature information in the input sound spectrum feature processed by the stacked convolutional neural network; and the sequence-to-sequence network is used for coding and decoding the input sound spectrum features processed by the recurrent neural network and outputting a character sequence.

The above-mentioned stacked convolutional neural network may be understood as a convolutional neural network including a plurality of convolutional layers stacked or connected in series, and is used for extracting frequency domain feature information based on input acoustic spectrum features. For example, the stacked convolutional neural networks described above may include, but are not limited to, inclusion networks, ResNet, densnet, custom convolutional neural networks, or the like.

The recurrent neural network may include a dual GRU network or an LSTM network for extracting temporal feature information based on frequency domain feature information output from the stacked convolutional neural network. Optionally, the number of layers of the recurrent neural network may be in a range of [1, 3 ]. The embodiment can enhance the extraction of the time characteristic information of the voice signal in the voice recognition process based on the recurrent neural network.

The Seq2Seq network may include an encoding layer and a decoding layer connected in sequence, and is configured to output a speech recognition result, that is, a character sequence, based on time characteristic information output by the recurrent neural network, and may implement end-to-end conversion from a speech signal to a text.

In summary, the target network for speech recognition provided by this embodiment can enhance the extraction of frequency domain feature information of a speech signal in a speech recognition process through stacked convolutional neural networks, enhance the extraction of time feature information of the speech signal in the speech recognition process through a recurrent neural network, and implement end-to-end conversion of the speech signal to a text through a Seq2Seq network, so that a speech recognition model trained based on the target network has a high speech recognition accuracy, and can implement end-to-end speech recognition.

The model training method provided by this embodiment trains the target network according to the sound spectrum characteristics of the N speech samples, and can obtain the speech recognition model for the preset characters, so that not only the network structure is simpler, but also the recognition speed is higher and the accuracy of the recognition result is higher.

Alternatively, the number of layers of the recurrent neural network may be 2.

Optionally, the stacked convolutional neural network is a residual network.

Optionally, M speech samples in the N speech samples are speech samples to which noise is added, and M is a positive integer smaller than or equal to N.

In this embodiment, noise may be added to part or all of the N speech samples, so that the speech recognition model obtained based on the training of the N speech samples has stronger anti-noise capability and generalization capability, and has a more stable recognition effect in the application process.

Referring to fig. 6, fig. 6 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention. As shown in fig. 6, the speech recognition apparatus 600 includes:

an extraction module 601, configured to extract a sound spectrum feature of a speech signal to be recognized;

the recognition module 602 is configured to input the feature information obtained after the audio spectrum feature is input into the stacked convolutional neural network for processing into a cyclic neural network, input a sequence of the feature information obtained after the cyclic neural network is processed into a sequence network, and output a character sequence corresponding to the voice signal to obtain a recognition result;

The stacked convolutional neural network, the cyclic neural network and the sequence-to-sequence network form a speech recognition model of the embodiment, and the speech recognition model is obtained by training a target network based on speech samples corresponding to preset characters.

Optionally, the stacked convolutional neural network is configured to perform enhancement processing on the frequency-domain feature in the input sound spectrum feature, and the cyclic neural network is configured to perform enhancement processing on time feature information in the input sound spectrum feature after the enhancement processing of the frequency-domain feature; and the sequence-to-sequence network is used for enhancing the processed sound spectrum characteristic output character sequence according to the input time characteristic information.

Optionally, the number of layers of the recurrent neural network is 2.

Optionally, the stacked convolutional neural network is a residual network.

The speech recognition apparatus 600 provided in the embodiment of the present invention can implement each process in the speech recognition method embodiments, and is not described here again to avoid repetition.

The speech recognition device 600 of the embodiment of the present invention includes an extraction module 601, configured to extract a voice spectrum feature of a speech signal to be recognized; the recognition module 602 is configured to input the voice spectrum features into a voice recognition model to recognize a character sequence corresponding to the voice signal, where the voice recognition model is a model obtained by training a target network based on a voice sample corresponding to a preset character, and the target network includes a stacked convolutional neural network, a cyclic neural network, and a sequence-to-sequence network, so that end-to-end voice recognition can be achieved, the speed of voice recognition is high, and the accuracy of voice recognition of the preset character can be improved.

Referring to fig. 7, fig. 7 is a structural diagram of a living body detecting apparatus according to an embodiment of the present invention. As shown in fig. 7, the living body detecting apparatus 700 includes:

the acquisition module 701 is used for acquiring a voice signal of a target character sequence read by an object to be detected and a video signal containing lips, wherein characters in the target character sequence are preset characters;

a first recognition module 702, configured to recognize the voice signal by using the voice recognition method described above, so as to obtain a first character sequence corresponding to the voice signal;

a second identifying module 703, configured to perform lip language identification on the video signal to obtain a second character sequence corresponding to the video signal;

a judging module 704, configured to judge whether the object is a living body according to the first character sequence and the second character sequence.

Optionally, the determining module is specifically configured to:

The biopsy device 700 provided in the embodiment of the present invention can implement each process in the above biopsy method embodiments, and is not described here again to avoid repetition.

The living body detection device 700 of the embodiment of the invention comprises an acquisition module 701, a detection module and a control module, wherein the acquisition module is used for acquiring a voice signal and a video signal, which contain lips, of a target character sequence read by an object to be detected, and characters in the target character sequence are preset characters; a first recognition module 702, configured to recognize the voice signal by using the voice recognition method described above, so as to obtain a first character sequence corresponding to the voice signal; a second identifying module 703, configured to perform lip language identification on the video signal to obtain a second character sequence corresponding to the video signal; the judging module 704 is configured to judge whether the object is a living body according to the first character sequence and the second character sequence, so that the speed and accuracy of living body detection can be improved.

Referring to fig. 8, fig. 8 is a structural diagram of a model training apparatus according to an embodiment of the present invention. As shown in fig. 8, the model training apparatus 800 includes:

an obtaining module 801, configured to obtain N voice samples, where the voice samples are voice samples corresponding to preset characters, and N is a positive integer;

an extracting module 802, configured to extract a sound spectrum feature of each of the N voice samples respectively;

a training module 803, configured to train a target network according to the sound spectrum features of the N speech samples, to obtain a speech recognition model;

The model training device 800 provided in the embodiment of the present invention can implement each process in the above-described model training method embodiments, and is not described here again to avoid repetition.

The model training device 800 of the embodiment of the present invention includes an obtaining module 801, configured to obtain N voice samples, where N is a positive integer; an extracting module 802, configured to extract a sound spectrum feature of each of the N voice samples respectively; and the training module 803 is configured to train the target network according to the sound spectrum features of the N speech samples, so as to obtain a speech recognition model. As the target network is trained according to the sound spectrum characteristics of the N voice samples, the voice recognition model aiming at the preset characters can be obtained, the network structure is simpler, the recognition speed is higher, and the accuracy of the recognition result is higher.

Referring to fig. 9, fig. 9 is a block diagram of a speech recognition apparatus according to still another embodiment of the present invention, and as shown in fig. 9, a speech recognition apparatus 900 includes: a processor 901, a memory 902 and a computer program stored on the memory 902 and operable on the processor, the various components in the data transmission device 900 being coupled together by a bus interface 903, the computer program realizing the following steps when executed by the processor 901:

Optionally, the number of layers of the recurrent neural network is 2.

Optionally, the stacked convolutional neural network is a residual network.

Referring to fig. 10, fig. 10 is a structural view of a living body detecting apparatus according to still another embodiment of the present invention, and as shown in fig. 10, the living body detecting apparatus 1000 includes: a processor 1001, a memory 1002 and a computer program stored on the memory 1002 and operable on the processor, the various components in the data transmission device 1000 being coupled together by a bus interface 1003, the computer program, when executed by the processor 1001, performing the steps of:

Optionally, the computer program, when executed by the processor 1001, is further configured to:

Referring to fig. 11, fig. 11 is a block diagram of a model training apparatus according to still another embodiment of the present invention, and as shown in fig. 11, a model training apparatus 1100 includes: a processor 1101, a memory 1102 and a computer program stored on the memory 1102 and operable on the processor, the various components in the data transmission device 1100 being coupled together by a bus interface 1103, the computer program, when executed by the processor 1101, performing the steps of:

An embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program, when executed by the processor, implements each process of the foregoing speech recognition method embodiment, or implements each process of the foregoing living body detection method embodiment, or implements each process of the foregoing model training method embodiment, and can achieve the same technical effect, and therefore, the details are not repeated here to avoid repetition.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the foregoing speech recognition method embodiment, or implements each process of the foregoing living body detection method embodiment, or implements each process of the foregoing model training method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described here again. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A speech recognition method, comprising:

2. The method according to claim 1, wherein the inputting the feature information obtained by inputting the voice spectrum feature into a stacked convolutional neural network for processing into a cyclic neural network, inputting the feature information obtained by processing into a sequence network for encoding and decoding, and outputting a character sequence corresponding to the voice signal to obtain the recognition result specifically comprises:

the stacked convolutional neural network is used for enhancing the frequency domain characteristics in the input sound spectrum characteristics;

the cyclic neural network performs enhancement processing on time characteristic information in the input sound spectrum characteristic processed by the stacked convolutional neural network;

and the sequence-to-sequence network encodes and decodes the input sound spectrum features processed by the recurrent neural network and outputs a character sequence.

3. A method of in vivo detection, comprising:

recognizing the voice signal by using the voice recognition method of any one of claims 1 to 2 to obtain a first character sequence corresponding to the voice signal;

4. The method according to claim 3, wherein the determining whether the object is a living body according to the first character sequence and the second character sequence comprises:

5. A method of model training, comprising:

6. The method of claim 5, wherein the stacked convolutional neural network is used for enhancing the frequency domain features in the input sonographic features, and the cyclic neural network is used for enhancing the time feature information in the input sonographic features processed by the stacked convolutional neural network; and the sequence-to-sequence network is used for coding and decoding the input sound spectrum features processed by the recurrent neural network and outputting a character sequence.

7. A speech recognition apparatus, comprising:

8. A living body detection device, comprising:

a first recognition module, configured to recognize the voice signal by using the voice recognition method according to any one of claims 1 to 2, so as to obtain a first character sequence corresponding to the voice signal;

9. An electronic device, comprising a processor, a memory and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the steps of a speech recognition method as claimed in any one of claims 1 to 2, or implementing the steps of a liveness detection method as claimed in any one of claims 3 to 4, or implementing the steps of a model training method as claimed in any one of claims 5 to 6.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 2, or the living body detection method according to any one of claims 3 to 4, or the model training method according to any one of claims 5 to 6.