CN111341307A

CN111341307A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111341307A
Application number: CN202010174196.1A
Authority: CN
Inventors: 张菁芸; 王少鸣; 郭润增
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-13
Filing date: 2020-03-13
Publication date: 2020-06-26

Abstract

The application discloses a voice recognition method, a voice recognition device, electronic equipment and a storage medium, and belongs to the technical field of computers. According to the method and the device, at least one voice frame of the voice to be recognized is obtained, the at least one voice frame is input into a voice recognition model, the at least one voice frame is subjected to weighted transformation based on a residual error structure through the voice recognition model, at least one prediction probability is output, and corresponding context information can be directly and rapidly introduced into the weighted transformation process of each voice frame due to the residual error structure, so that the deeper voice features can be extracted, the prediction probability output by the voice recognition model has higher accuracy, and further, the voice keywords contained in the voice to be recognized are determined based on the at least one prediction probability, and the process of recognizing the voice keywords based on the prediction probability has higher accuracy.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, users can conveniently perform voice ordering, voice shopping and other operations through terminals, and voice keyword recognition is a core problem in the voice interaction technology. In the current speech keyword recognition system, the recognition method based on the LSTM (Long Short-Term Memory network) model has better performance, and solves the inherent gradient disappearance problem of the traditional RNN (Recurrent neural network) through the complex interaction of an input gate, an output gate and a forgetting gate. However, the performance and accuracy of speech keyword recognition of the LSTM model still remain to be improved.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, and can improve the performance and accuracy of a voice keyword recognition process. The technical scheme is as follows:

in one aspect, a speech recognition method is provided, and the method includes:

acquiring at least one voice frame of a voice to be recognized;

inputting the at least one voice frame into a voice recognition model, performing weighted transformation based on a residual error structure on the at least one voice frame through the voice recognition model, and outputting at least one prediction probability, wherein one prediction probability is used for expressing the probability that the voice to be recognized contains one voice keyword;

and determining the voice keywords contained in the voice to be recognized based on the at least one prediction probability.

In one aspect, a speech recognition apparatus is provided, the apparatus comprising:

the acquisition module is used for acquiring at least one voice frame of the voice to be recognized;

the weighted transformation module is used for inputting the at least one voice frame into a voice recognition model, performing weighted transformation based on a residual error structure on the at least one voice frame through the voice recognition model, and outputting at least one prediction probability, wherein one prediction probability is used for expressing the probability that the voice to be recognized contains one voice keyword;

and the first determining module is used for determining the voice key words contained in the voice to be recognized based on the at least one prediction probability.

In one possible implementation, the obtaining module is further configured to: responding to the voice keywords including target keywords, and acquiring target voice for voiceprint recognition;

the device also comprises a voiceprint recognition module used for carrying out voiceprint recognition on the target voice to obtain a voiceprint recognition result of the target voice, wherein the voiceprint recognition result is used for indicating whether the user to which the target voice belongs is the target user.

In one possible embodiment, the voiceprint recognition module comprises:

a second extraction unit, configured to input the target speech into a voiceprint recognition model, and perform feature extraction on the target speech through the voiceprint recognition model to obtain a voiceprint feature containing noise of the target speech;

the noise reduction unit is used for carrying out noise reduction processing on the noise-containing voiceprint feature of the target voice to obtain a pure voiceprint feature of the target voice;

and the determining unit is used for determining the voiceprint recognition result based on the similarity between the pure voiceprint characteristics and the voiceprint characteristics of the target user stored in the voiceprint library.

In one possible embodiment, the noise reduction unit is configured to:

and inputting the characteristics of the noisy voiceprint into a deep neural network, carrying out nonlinear mapping on the characteristics of the noisy voiceprint through the deep neural network, and outputting the characteristics of the pure voiceprint.

In one aspect, an electronic device is provided that includes one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded by the one or more processors and executed to implement the operations performed by the speech recognition method according to any of the possible implementations described above.

In one aspect, a storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to perform the operations performed to implement the speech recognition method according to any of the above possible implementations.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the method comprises the steps of obtaining at least one voice frame of voice to be recognized, inputting the at least one voice frame into a voice recognition model, carrying out weighted transformation based on a residual error structure on the at least one voice frame through the voice recognition model, outputting at least one prediction probability, and determining voice keywords contained in the voice to be recognized based on the at least one prediction probability because the residual error structure can directly and quickly introduce corresponding context information in the weighted transformation process of each voice frame, thereby being beneficial to extracting deeper voice features, enabling the prediction probability output by the voice recognition model to have higher accuracy.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a speech recognition method according to an embodiment of the present application;

FIG. 2 is an interaction flow diagram of a speech recognition method provided by an embodiment of the present application;

FIG. 3 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an LSTM model provided in an embodiment of the present application;

fig. 5 is a flowchart of a voiceprint recognition method provided by an embodiment of the present application;

fig. 6 is a flowchart of a voice ordering system according to an embodiment of the present application;

FIG. 7 is a comparison of ROC curves provided by examples of the present application;

fig. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more, for example, a plurality of first locations means two or more first locations.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises an audio processing technology, a computer vision technology, a natural language processing technology, machine learning/deep learning and the like.

The audio processing Technology (Speech Technology, also called Speech processing Technology) becomes one of the best viewed human-computer interaction modes in the future, and specifically includes a Speech keyword Recognition Technology, a voiceprint Recognition Technology, a Speech separation Technology, an automatic Speech Recognition Technology (ASR), a Speech synthesis Technology (Text To Speech, TTS, also called Text-To-Speech Technology), and the like.

With the development of the AI technology, research and application of the audio processing technology have been developed in a plurality of fields, such as common intelligent voice assistants, voice shopping systems, intelligent speakers, voice front-end processing on vehicle-mounted or television boxes, voice recognition products, voiceprint recognition products, and the like.

The embodiment of the application relates to a speech keyword recognition technology and a voiceprint recognition technology in the technical field of audio processing, wherein the speech keyword recognition technology is used for recognizing which speech keywords are contained in the words spoken by a user, and the voiceprint recognition technology is used for recognizing whether a section of speech is spoken by a target user, namely, the speech is recognized whether the user is the user.

Taking a voice shopping system (also called a voice ordering system, such as voice ordering) as an example, in a voice ordering scene, a user can input a voice to be recognized, "i want to order fish-flavor shredded meat" to the voice shopping system through a terminal, after recognizing a voice keyword "fish-flavor shredded meat" through a voice keyword recognition technology, a voiceprint payment portal may be provided to the user's terminal, and after the user confirms to make voiceprint payment, the terminal collects the user's target voice, the target voice may be a user reciting the specified content, sending the target voice to a voice shopping system, and the background of the voice shopping system performs voiceprint recognition based on the target voice, whether the target voice belongs to the user, if the target voice belongs to the user, the voiceprint recognition is passed, the account transfer system in the background performs payment settlement on the dishes ordered by the user, and a payment result is returned to the terminal of the user.

In the above process, the speech keyword recognition technology is a focus problem of a speech shopping system, and at present, in the aspect of speech keyword recognition, an acoustic Model based on an HMM-GMM (Hidden Markov Model-Gaussian Mixed Model) has a wider application, but the GMM Model has a certain defect in description capability, so that the DNN (Deep Neural Networks) Model has a better information expression capability compared with the HMM-GMM Model, and thus the DNN Model has a certain development. However, the DNN model still has some disadvantages, for example, the DNN model lacks the capability of retaining history information, and the speech signal belongs to a signal type with a large context correlation, so that the accuracy of the DNN model in speech keyword recognition is not high, and therefore the LSTM (long short-Term Memory network) model enters the field of vision of people, and the LSTM model not only has a permanent Memory capability, but also can solve the problem of disappearance of the gradient inherent to the RNN (Recurrent neural network) model through the complex interaction of an input gate, an output gate, and a forgetting gate, however, in order to pursue better speech shopping system performance, how to optimize the speech recognition accuracy of the LSTM model still is a problem that needs to be solved urgently.

In view of this, the embodiment of the present application provides a speech recognition method, which may be referred to as a novel improved residual LSTM model, where a residual structure can separate memory units in a time domain through a fast channel in a space domain, so as to further improve the accuracy of a speech recognition process.

Fig. 1 is a schematic diagram of an implementation environment of a speech recognition method according to an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102, and the terminal 101 and the server 102 are both electronic devices.

The terminal 101 may be configured to collect a voice signal, and a collection component of the voice signal, such as a recording element like a microphone, may be installed on the terminal 101, or the terminal 101 may also directly download a segment of audio file, and decode the audio file to obtain the voice signal.

In some embodiments, a Processing component for a voice signal may be installed on the terminal 101, so that the terminal 101 may independently implement the voice recognition method provided by the embodiment of the present application, for example, the Processing component may be a Digital Signal Processor (DSP), and program codes of the voice recognition model and the voiceprint recognition model provided by the embodiment of the present application may be run on the DSP, so as to implement keyword recognition and voiceprint recognition on the voice signal.

In some embodiments, after the terminal 101 collects the speech to be recognized through the collection component, the speech to be recognized may also be sent to the server 102, the server 102 performs speech recognition processing on the speech to be recognized, for example, the program code of the speech recognition model provided in the embodiments of the present application is run on the server 102 to recognize a speech keyword included in the speech to be recognized, if the speech keyword hits some target keywords, the terminal 101 collects target speech for voiceprint recognition through the collection component again, sends the target speech to the server 102, the server 102 runs the program code of the voiceprint recognition model provided in the embodiments of the present application to recognize whether the target speech belongs to a target user (a user associated with the terminal) or not, thereby completing a voiceprint recognition task for the target speech, and finally, in some voiceprint payment scenarios, if the voiceprint recognition is passed (that is, the target voice belongs to the target user), the server 102 may also call the transfer system to perform bill settlement, thereby completing the overall voice shopping process.

The terminal 101 and the server 102 may be connected through a wired network or a wireless network.

The server 102 may be configured to process voice signals, and the server 102 may include at least one of a server, a plurality of servers, a cloud computing platform, or a virtualization center. Alternatively, the server 102 may undertake primary computational tasks and the terminal 101 may undertake secondary computational tasks; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the terminal 101 and the server 102 perform cooperative computing by using a distributed computing architecture.

Optionally, the terminal 101 may refer to one of a plurality of terminals in general, and the device type of the terminal 101 includes but is not limited to: at least one of a smart phone, a smart speaker, a tablet computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer III) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, a laptop or a desktop computer. The following embodiments are exemplified in the case where the terminal includes a smartphone.

Those skilled in the art will appreciate that the number of terminals 101 described above may be greater or fewer. For example, the number of the terminals 101 may be only one, or the number of the terminals 101 may be several tens or hundreds, or more. The number and the device type of the terminals 101 are not limited in the embodiment of the present application.

Fig. 2 is an interaction flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 2, the embodiment is applied to an interaction process between a terminal and a server, and includes:

200. the terminal collects the voice to be recognized and sends the voice to be recognized to the server.

Wherein, the speech to be recognized comprises at least one speech frame.

The terminal can be provided with an application program, a user can trigger an audio acquisition instruction in the application program, a terminal operating system responds to the audio acquisition instruction, a recording interface is called, a voice signal acquisition assembly (such as a microphone) is driven to acquire voices to be recognized in an audio stream mode, or the terminal can also select a section of audio from locally pre-stored audio to serve as voices to be recognized, or the terminal can download audio files from a cloud end and analyze the audio files to obtain the voices to be recognized.

In an exemplary scenario, taking a voice ordering scenario as an example, a user may install an application program of a voice ordering system on a terminal, and after the user starts the voice ordering system, the user inputs a voice to be recognized, "i want a fish-flavored shredded pork", to the voice ordering system, and after the voice to be recognized is collected by the voice ordering system, the voice to be recognized is sent to a server in a TCP (Transmission control protocol) message form.

201. The server obtains at least one voice frame of the voice to be recognized.

In the above process, the server may receive a voice transmission packet carrying a voice to be recognized, analyze the voice transmission packet to obtain the voice to be recognized, and further, the server may perform VAD (voice activity Detection) processing on the voice to be recognized, eliminate a silent period in the voice to be recognized, and perform framing processing on the voice to be recognized to obtain the at least one voice frame.

In some embodiments, the server may detect a portion of the speech to be recognized where the signal energy is lower than the energy threshold through VAD techniques, and determine the portion as a silence period, so as to delete the silence period from the speech to be recognized, thereby completing the preliminary filtering of the speech to be recognized. Optionally, the server may perform voice activity detection by using a dual-threshold method, a correlation coefficient method, an Empirical Mode Decomposition (EMD) method, a wavelet transform method, or other methods, and the embodiment of the present disclosure does not specifically limit which method is specifically used to implement voice activity detection.

202. The server inputs the at least one voice frame into a voice recognition model, the at least one voice frame is subjected to weighted transformation based on a residual error structure through the voice recognition model, and at least one prediction probability is output, wherein the prediction probability is used for expressing the probability that the voice to be recognized contains a voice keyword.

In the above process, the speech recognition model may be an LSTM model based on a residual structure, may also be a BLSTM (Bidirectional Long Short-Term Memory) model based on a residual structure, and may also be another acoustic model for performing speech keyword recognition, and the embodiment of the present application does not specifically limit the type of the speech recognition model.

In some embodiments, taking the speech recognition model as the LSTM model based on the residual structure as an example for explanation, the LSTM model based on the residual structure (hereinafter, referred to as "LSTM model") may include an input layer, a hidden layer and an output layer, which are connected in series, that is, the output of the previous layer is used as the input of the next layer, based on the model structure, fig. 3 is a flowchart of a speech recognition method provided in this embodiment, and as shown in fig. 3, the server may obtain the above prediction probability by performing the following sub-steps 2021 and 2026:

2021. the server inputs the at least one speech frame into an input layer in the LSTM model, through which the frequency characteristics of the at least one speech frame are extracted.

In some embodiments, the server may perform Mel-Frequency Cepstrum analysis on the at least one speech frame in the input layer, and use Mel-Frequency Cepstrum Coefficient (MFCC) obtained by the Mel-Frequency Cepstrum Coefficient analysis as the Frequency characteristic of the at least one speech frame, so as to reflect different hearing sensitivities of human ears to sound waves of different frequencies through MFCC information.

In some embodiments, the input layer may be further regarded as a feature extraction layer, or may be regarded as a feature extraction sub-network, and in the feature extraction sub-network, the server may perform convolution processing on the at least one speech frame to obtain the frequency feature of the at least one speech frame, so that the accuracy of the feature extraction process can be further improved through optimization of the feature extraction sub-network.

2022. The server inputs the frequency characteristics of the at least one voice frame into at least one memory unit of a hidden layer in the LSTM model respectively, and the at least one memory unit performs weighted transformation based on a residual error structure on the frequency characteristics of the at least one voice frame to output a feature vector of the at least one voice frame.

In the above process, the hidden layer of the LSTM model includes at least one memory unit, and each memory unit corresponds to the frequency characteristic of one speech frame in the input layer. The step of inputting the frequency characteristic of the at least one speech frame into the at least one memory unit means that the frequency characteristic of one speech frame is respectively input into a corresponding memory unit, so that each memory unit corresponds to the frequency characteristic of one speech frame, for example, the frequency characteristic of the nth (n ≧ 1) speech frame is input into n memory units.

In some embodiments, for any memory unit, the server performs, in response to the frequency characteristic of the speech frame corresponding to the memory unit and the processing result of the previous memory unit, weighted transformation on the frequency characteristic of the speech frame to obtain an intermediate vector of the speech frame, fuses the intermediate vector of the speech frame and the processing result of the previous memory unit to obtain a feature vector of the speech frame, and performs the above operation on each memory unit to obtain a feature vector of at least one speech frame. In the process, the processing results of the intermediate vector and the last memory unit are fused, so that the memory units in the time domain can be separated through the fast channel of the residual structure in the space domain, and the extraction of the feature vector with deeper level and better expression capability is facilitated.

2023. The server inputs the feature vector of the at least one speech frame into a projection layer, and determines whether to carry out iterative projection on the at least one speech frame through the projection layer.

The projection layer is used for reducing the dimension of the feature vector, namely projecting an original high-dimensional vector to a low-dimensional vector, and according to whether iterative projection is performed, the projection layer can be divided into an iterative projection layer and a non-iterative projection layer, the following step 2024 can be executed through the iterative projection layer, the original high-dimensional feature vector is continuously subjected to iterative projection, the following step 2025 can be executed through the non-iterative projection layer, and the low-dimensional vector obtained by projection is input into the output layer after the low-dimensional vector is projected.

Fig. 4 is a schematic diagram of an LSTM model according to an embodiment of the present application, please refer to fig. 4, in an LSTM model 400 according to an embodiment of the present application, a frequency characteristic x of a speech frame at a time t_tAfter inputting the t-th memory cell (cell), the frequency characteristic x of the speech frame at time t is measured by tanh function (an activation function)_tAnd the last memory unit based on the processing result r of the iterative projection layer_t-1Performing activation processing, and respectively passing through input gates i_tForgetting door f_tAnd an output gate o_tTo obtain the intermediate vector c of the speech frame at time t_tThen, a residual structure 401 is used to determine the intermediate vector c of the speech frame at time t_tAnd the processing result r of the previous memory cell_t-1Carrying out weighting transformation to obtain the feature vector m of the speech frame at the time t_tThe feature vector m of the speech frame at the time t_tInputting the feature vector into a projection layer (projection)402, determining whether to perform iterative projection, and if so, outputting the feature vector r of the iterative projection layer_tInputting the feature vector p into a next memory unit, and outputting the feature vector p of the non-iterative projection layer if the iterative projection is determined not to be performed_tInputting the result into the next memory cell, and finally determining that the output result of the t memory cell is h_t(h_tIs equal to p_tOr r_t) Wherein t is a number greater than or equal to 0.

The LSTM model can separate the memory units of the time domain through a fast channel (namely a residual structure) based on the space domain, and is beneficial to extracting deeper feature vectors, so that the LSTM model has more excellent system performance and speech recognition accuracy.

2024. The server responds to the determination to carry out iterative projection, and re-inputs the feature vector of the at least one speech frame into the at least one memory unit to carry out iterative weighted transformation.

In the process, the server equivalently performs one-time dimensionality reduction mapping through the projection layer, maps the high-dimensional characteristic vector into a low-dimensional characteristic vector, if iterative projection is determined, the projection layer is equivalent to the iterative projection layer at the moment, the output low-dimensional characteristic vector can be directly input into the next memory unit for iterative weighted transformation, and therefore the overall parameters to be trained of the LSTM model can be reduced by performing dimensionality reduction mapping on the output characteristic vector; on the contrary, if it is determined that the iterative projection is not performed, the projection layer is equivalent to a non-iterative projection layer at this time, which indicates that the dimensionality reduction mapping made by the projection layer does not contribute much to the reduction of the model parameters, at this time, the following step 2025 may be performed, and the feature vector output by the non-iterative projection layer is directly input into the output layer.

2025. The server inputs the feature vector of the at least one speech frame into an output layer of the LSTM model in response to determining not to perform the iterative projection.

In the above process, if it is determined that iterative projection is not performed, the original feature vector may be directly input to the output layer, or the feature vector obtained after projection may also be input to the output layer.

In the step 2023-2025, the server may perform the dimension reduction processing on the feature vector output by the hidden layer by introducing the projection layer, so as to reduce the parameter amount to be trained in the LSTM model, thereby greatly improving the training efficiency of the LSTM model.

In some embodiments, the server may not perform the

step

2023 and 2025, that is, a mapping layer is not introduced into the LSTM model, so as to simplify the training process of the speech recognition method.

2026. The server inputs the feature vector of the at least one speech frame into an output layer in the LSTM model, through which the feature vector of the at least one speech frame is mapped to the at least one prediction probability.

In the process, an output layer of the LSTM model includes at least one keyword tag, a phoneme sequence of the speech to be recognized is obtained by performing exponential normalization (softmax) processing on a feature vector of the at least one speech frame, the phoneme sequence is decoded by the language model to obtain a text sequence corresponding to the phoneme sequence, and then the text sequence is subjected to keyword matching with pre-stored speech keywords to obtain a prediction probability that the speech to be recognized includes each speech keyword.

In the above step 2021-2026, only taking the speech recognition model as the LSTM model based on the residual structure as an example, the process of obtaining the prediction probability is described, in some embodiments, the LSTM model may include multiple hidden layers, and thus may be referred to as a multiple-layer LSTM model, for example, a 3-layer LSTM model, where feature vectors output by each hidden layer are not directly input into an output layer, but are input into a next hidden layer, and only feature vectors output by a last hidden layer are input into the output layer.

203. The server determines the voice keywords contained in the voice to be recognized based on the at least one prediction probability.

In some embodiments, for any prediction probability, if the prediction probability is greater than a probability threshold, the server may determine that the speech to be recognized includes the speech keyword corresponding to the prediction probability, otherwise, the server may determine that the speech to be recognized does not include the speech keyword corresponding to the prediction probability, where the probability threshold may be any value greater than or equal to 0 and less than or equal to 1, and a value of the probability threshold is not specifically limited in the embodiments of the present application.

In some embodiments, the server may also sequence the speech keywords corresponding to the prediction probability in an order from the largest to the smallest of the prediction probabilities, and determine that the speech to be recognized includes the speech keywords sequenced at the front target position, where the digit of the front target position may be any integer greater than or equal to 1, and the value of the digit of the front target position is not specifically limited in the embodiments of the present application.

204. And responding to the voice keywords including the target keywords, the terminal collects the target voice for voiceprint recognition and sends the target voice to the server.

In the above process, if the voice keyword includes a target keyword, the server may issue a content specified by the target voice to the terminal, after the user acquires the specified content through the application program, an interface for voiceprint recognition may be triggered in the application program, a recording component is called to record a voice reciting the specified content by himself, the voice is sent to the server as the target voice, and the process of specifically acquiring a voice signal and the process of sending the voice are similar to the above step 200, and details are not repeated here.

205. The server acquires a target voice for voiceprint recognition.

Step 205 is similar to step 201 and will not be described herein.

206. And the server performs voiceprint recognition on the target voice to obtain a voiceprint recognition result of the target voice, wherein the voiceprint recognition result is used for indicating whether the user to which the target voice belongs is the target user.

In the foregoing process, the server may perform voiceprint recognition through a voiceprint recognition model, and certainly, may also perform voiceprint recognition through a template matching method, a nearest neighbor clustering method, and the like, in this embodiment of the present application, the voiceprint recognition model is taken as an example for description, fig. 5 is a flowchart of a voiceprint recognition method provided in this embodiment of the present application, please refer to fig. 5, the voiceprint recognition process may include the following sub-steps 2061 and 2063:

2061. and the server inputs the target voice into the voiceprint recognition model, and performs feature extraction on the target voice through the voiceprint recognition model to obtain the noise-containing voiceprint feature of the target voice.

In the above process, the noisy print feature may be a noisy I-Vector (I-Vector), where the I-Vector is a compact Vector used to represent the voice feature of the speaker, and in some embodiments, the noisy print feature may also be a noisy mean value super Vector, and the embodiment of the present application does not specifically limit the type of the noisy print feature.

In some embodiments, the server may perform feature extraction on the target speech based on the global difference space model to obtain a noisy I-Vector of the target speech, or the server may perform Joint Factor Analysis (JFA) on the target speech to obtain a noisy mean value super-Vector of the target speech.

2062. And the server performs noise reduction processing on the noise-containing voiceprint feature of the target voice to obtain a pure voiceprint feature of the target voice.

In some embodiments, the server may perform noise reduction processing through a deep neural network, and at this time, the server may input the noisy voiceprint feature into the deep neural network, perform nonlinear mapping on the noisy voiceprint feature through the deep neural network, and output a clean voiceprint feature.

In the process, a complex nonlinear function relation exists between the noisy voiceprint feature and the pure voiceprint feature, noise reduction is performed on the noisy voiceprint feature through the deep neural network, the powerful fitting capacity of the deep neural network can be utilized, the deep neural network is trained to learn the nonlinear mapping relation between the noisy voiceprint feature and the pure voiceprint feature, a noisy voiceprint feature can be given, the approximate representation of the pure voiceprint feature can be obtained, and therefore the accuracy of the noise reduction process is improved.

In some embodiments, the server may further perform noise reduction processing on the noise-containing print feature in a nearest neighbor clustering manner, a convolutional neural network manner, a support vector machine manner, and the like, and the method of the noise reduction processing is not specifically limited in this embodiment of the application.

Since the voiceprint feature of the voice is changed accordingly when the voiceprint of the pure voice is interfered by the background noise, the performance of the voiceprint recognition system is significantly reduced due to the change of the voiceprint feature of the voice in the noise environment, and in the step 2062, the voiceprint feature of the target voice can be enhanced once before the voiceprint recognition is performed by performing the noise reduction processing on the voiceprint feature containing the noise, which is beneficial to improving the accuracy of the voiceprint recognition process and improving the performance of the voiceprint recognition system.

In some embodiments, the server may not perform noise reduction on the noisy voiceprint feature, that is, the step 2062 is not performed, and calculates the similarity between the noisy voiceprint feature and the voiceprint feature of the target user based on the noisy voiceprint feature, so as to simplify the flow of the voiceprint recognition process.

It should be noted that, taking the voiceprint feature as an I-Vector as an example for description, before performing noise reduction processing based on the deep neural network, the server may train a deep neural network based on the sample noisy I-Vector and the sample clean I-Vector, where the deep neural network takes the noisy I-Vector as an input and the clean I-Vector as an output. In other words, in the training process, the I-Vector of the sample noisy speech can be collected to serve as the noisy I-Vector, the I-Vector of the sample pure speech is collected to serve as the pure I-Vector, the sample noisy I-Vector serves as input, and the sample pure I-Vector serves as label data, so that a deep neural network used for I-Vector enhancement is trained, the effect of data enhancement of the I-Vector can be achieved, and the stability and accuracy of the voiceprint recognition process can be improved.

2063. The server determines a voiceprint recognition result based on the similarity between the clean voiceprint features and the voiceprint features of the target user stored in the voiceprint library.

In the foregoing process, the server may calculate a similarity between the clean voiceprint feature and the voiceprint feature of the target user, for example, the similarity may be a cosine similarity, an inverse of an euclidean distance, or the like, and the embodiment of the present application does not specifically limit the form of the similarity. Further, if the similarity is greater than the similarity threshold, it may be determined that the voiceprint recognition result is a pass recognition, otherwise, it may be determined that the voiceprint recognition result is a failure recognition.

Fig. 6 is a flowchart of a voice ordering system provided in an embodiment of the present application, please refer to fig. 6, which illustrates an exemplary scenario, in the voice ordering system 600, in response to a voice to be recognized input by a user including a name of any menu item, a server may provide an ordering interface of the menu item to a terminal where the user is located, the terminal displays the ordering interface of the menu item, in which menu information and an ordering option may be included, optionally, the menu information may include at least one of a menu picture, a menu description, a menu price, or a merchant to which the menu item belongs, in response to a click operation of the ordering option by the user, the terminal may first prompt the user to log in a personal account, which has an association relationship with user information (especially a voiceprint characteristic of the user), and then the user may confirm ordering in the ordering interface after logging in the personal account, at this time, the terminal may present at least one payment verification manner to the user, where the payment verification manner may include voiceprint recognition, password recognition, face recognition, fingerprint recognition, and the like, the terminal may complete the voiceprint recognition through interaction with the server in response to a trigger operation of the user on the voiceprint recognition manner, and specifically, the terminal submits order information of the dish to the server, where the order information may include a dish name, a dish price, and a dish quantity, the server issues a specified content for performing the voiceprint recognition to the terminal, after the user inputs a target voice reciting the specified content to the voice ordering system 600, the server performs voice confirmation (i.e., voiceprint recognition) on the target voice, confirms whether a speaker of the target voice is the same as a corresponding target user in the user information, and if the speaker of the target voice is the same as the corresponding target user in the user information, the server generates a deduction certificate of the order, and requesting a deduction from the bank system according to the certificate, and after the deduction is completed, the user can inquire the payment result of the voiceprint payment at the terminal.

It should be noted that, in the above example, only the voice ordering scenario is taken as an example for description, which should not be construed as a limitation to an application scenario of the embodiment of the present application, and optionally, the voice recognition and voiceprint recognition method provided in the embodiment of the present application may be applied to any voice shopping scenario, such as voice car-taking, voice online shopping, and the like.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

According to the method provided by the embodiment of the application, at least one voice frame of the voice to be recognized is obtained, the at least one voice frame is input into a voice recognition model, the at least one voice frame is subjected to weighted transformation based on a residual error structure through the voice recognition model, at least one prediction probability is output, and the residual error structure can directly and quickly introduce corresponding context information in the weighted transformation process of each voice frame, so that the extraction of deeper voice features is facilitated, the prediction probability output by the voice recognition model has higher accuracy, and further, the voice keywords contained in the voice to be recognized are determined based on the at least one prediction probability, and the process of recognizing the voice keywords based on the prediction probability has higher accuracy.

Next, taking a three-layer LSTM model based on a residual structure as an example, showing a result of evaluating system performance, please refer to table 1, where table 1 shows accuracy rates of prediction performed on 10 speech keywords (KeyWord, KW) by three different speech recognition models, namely, an LSTM model, a BLSTM model, and an LSTM model based on a residual structure (referred to as a residual LSTM model for short) provided in this application embodiment.

TABLE 1

Accuracy rate	LSTM	BLSTM	Residual LSTM
				KW1	78.3％	82.6％	84.8％
KW2	77.4％	91.9％	79.6％
				KW3	79.1％	92.1％	91.4％
KW4	91.3％	95.9％	96.7％
				KW5	76.4％	96.4％	88.9％
KW6	79.8％	78.1％	85.2％
				KW7	74.7％	81.1％	88.3％
KW8	86.9％	75.3％	89.1％
				KW9	87.8％	85.5％	82.5％
KW10	89.5％	84.1％	79.7％

From the analysis of the data in the above table, by averaging the respective accuracy rates of the above 10 KW, it can be calculated that the average accuracy rate of 10 KW in the LSTM model is 82.1%, and the average accuracy rate of 10 KW in the BLSTM model is 86.3%, and the average accuracy rate of 10 KW in the residual LSTM model provided in the embodiment of the present application is 86.6%, so that, in the whole, the accuracy rate improvement effect of the residual LSTM model on the speech keyword recognition is ideal, and the accuracy rates of KW3 and KW4 both exceed 90%, and thus, the speech recognition accuracy is good.

Further, please refer to table 2, where table 2 shows respective quantities of parameters to be trained of the LSTM model, the BLSTM model, and the residual LSTM model provided in the embodiment of the present application.

TABLE 2

	LSTM	BLSTM	Residual LSTM
				Quantity of model parameters	24M	37M	34M

It can be known from the analysis in the above table that, because the size of the parameter directly affects the training efficiency of the model, and the parameter quantity of the residual LSTM model is larger than the LSTM but still smaller than the BLSTM model, compared with the BLSTM model, the residual LSTM model not only has higher accuracy in model prediction, but also has higher training efficiency in model training, because the projection layer is introduced into the residual LSTM model and the feature vector is subjected to the dimension reduction mapping process, the parameter quantity to be trained of the model is reduced, and the training efficiency of the model is optimized.

Further, referring to fig. 7, fig. 7 shows respective operating performance curves 700 of the LSTM model, the BLSTM model, and the residual LSTM model provided in the embodiment of the present application, where the adopted performance curve is a ROC curve (receiver operating Characteristic curve), in which a virtual surprise rate is a horizontal axis coordinate and a recall rate is a vertical axis coordinate, and under the same virtual surprise rate, a higher recall rate indicates better performance of the model. As can be seen from fig. 7, the residual LSTM model has better performance than both the LSTM model and the BLSTM model.

Fig. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application, please refer to fig. 8, where the apparatus includes:

an obtaining module 801, configured to obtain at least one speech frame of a speech to be recognized;

a weighted transformation module 802, configured to input the at least one speech frame into a speech recognition model, perform weighted transformation based on a residual error structure on the at least one speech frame through the speech recognition model, and output at least one prediction probability, where one prediction probability is used to indicate a probability that the speech to be recognized includes a speech keyword;

a first determining module 803, configured to determine a speech keyword included in the speech to be recognized based on the at least one prediction probability.

The device provided by the embodiment of the application inputs at least one voice frame into a voice recognition model by acquiring the at least one voice frame of the voice to be recognized, performs weighted transformation based on a residual error structure on the at least one voice frame through the voice recognition model, and outputs at least one prediction probability.

In one possible embodiment, the speech recognition model is a long-short term memory network (LSTM) model based on a residual structure;

based on the apparatus components of fig. 8, the weighted transformation module 802 includes:

a first extraction unit, configured to input the at least one speech frame into an input layer in the LSTM model, and extract a frequency characteristic of the at least one speech frame through the input layer;

a weighted transformation unit, configured to input the frequency characteristics of the at least one speech frame into at least one memory unit of a hidden layer in the LSTM model, respectively, perform weighted transformation based on a residual structure on the frequency characteristics of the at least one speech frame through the at least one memory unit, and output a feature vector of the at least one speech frame;

a mapping unit, configured to input the feature vector of the at least one speech frame into an output layer in the LSTM model, and map the feature vector of the at least one speech frame to the at least one prediction probability through the output layer.

In one possible embodiment, the weighted transformation unit is configured to:

for any memory unit, in response to the frequency characteristic of the speech frame corresponding to the memory unit and the processing result of the previous memory unit, performing weighted transformation on the frequency characteristic of the speech frame to obtain an intermediate vector of the speech frame, and fusing the intermediate vector of the speech frame and the processing result of the previous memory unit to obtain a feature vector of the speech frame.

In a possible embodiment, based on the apparatus composition of fig. 8, the apparatus further comprises:

the second determining module is used for inputting the characteristic vector of the at least one voice frame into a projection layer and determining whether to carry out iterative projection on the at least one voice frame through the projection layer;

the iterative transformation module is used for responding to the determination of iterative projection, and inputting the feature vector of the at least one voice frame into the at least one memory unit again for iterative weighted transformation;

an input module for inputting the feature vector of the at least one speech frame into the output layer in response to determining not to perform iterative projection.

In a possible implementation, the obtaining module 801 is further configured to: responding to the voice keywords including target keywords, and acquiring target voice for voiceprint recognition;

based on the apparatus composition of fig. 8, the apparatus further includes a voiceprint recognition module, configured to perform voiceprint recognition on the target voice, so as to obtain a voiceprint recognition result of the target voice, where the voiceprint recognition result is used to indicate whether a user to which the target voice belongs is a target user.

In a possible implementation, based on the apparatus composition of fig. 8, the voiceprint recognition module includes:

the second extraction unit is used for inputting the target voice into the voiceprint recognition model and extracting the characteristics of the target voice through the voiceprint recognition model to obtain the characteristics of the target voice containing noise voiceprint;

In one possible embodiment, the noise reduction unit is configured to:

inputting the characteristics of the noise-containing voiceprint into a deep neural network, carrying out nonlinear mapping on the characteristics of the noise-containing voiceprint through the deep neural network, and outputting the characteristics of the pure voiceprint.

It should be noted that: in the speech recognition apparatus provided in the above embodiment, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the electronic device may be divided into different functional modules to complete all or part of the functions described above. In addition, the speech recognition apparatus and the speech recognition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the speech recognition method embodiments, and are not described herein again.

Fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the electronic device may be a terminal or a server. The electronic device 900 may be: a smart phone, a tablet computer, an MP3 player (Moving picture Experts Group Audio Layer III, motion picture Experts compression standard Audio Layer 3), an MP4 player (Moving picture Experts Group Audio Layer IV, motion picture Experts compression standard Audio Layer 4), a notebook computer or a desktop computer. The electronic device 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the electronic device 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the speech recognition methods provided by the various embodiments herein.

In some embodiments, the electronic device 900 may further optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a touch display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 905 may be one, providing the front panel of the electronic device 900; in other embodiments, the number of the display panels 905 may be at least two, and the at least two display panels are respectively disposed on different surfaces of the electronic device 900 or are in a folding design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and located at different locations of the electronic device 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate a current geographic location of the electronic device 900 to implement navigation or LBS (location based Service). The positioning component 908 may be a positioning component based on the GPS (global positioning System) of the united states, the beidou System of china, the graves System of russia, or the galileo System of the european union.

The power supply 909 is used to supply power to various components in the electronic device 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 900 also includes one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic device 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the touch display 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the electronic device 900, and the gyro sensor 912 and the acceleration sensor 911 cooperate to acquire a 3D motion of the user on the electronic device 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the electronic device 900 and/or underneath the touch display screen 905. When the pressure sensor 913 is disposed on the side frame of the electronic device 900, the user's holding signal of the electronic device 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the touch display 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the electronic device 900. When a physical button or vendor Logo is provided on the electronic device 900, the fingerprint sensor 914 may be integrated with the physical button or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the touch display 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 905 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 905 is turned down. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

The proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of the electronic device 900. The proximity sensor 916 is used to capture the distance between the user and the front of the electronic device 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the electronic device 900 gradually decreases, the processor 901 controls the touch display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the electronic device 900 becomes gradually larger, the processor 901 controls the touch display 905 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of the electronic device 900, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory, including at least one program code, which is executable by a processor in a terminal to perform the speech recognition method in the above embodiments. For example, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random-Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

acquiring at least one voice frame of a voice to be recognized;

2. The method of claim 1, wherein the speech recognition model is a long short term memory network (LSTM) model based on a residual structure;

the inputting the at least one speech frame into a speech recognition model, and performing a weighted transformation based on a residual error structure on the at least one speech frame through the speech recognition model, and outputting at least one prediction probability includes:

inputting the at least one voice frame into an input layer in the LSTM model, and extracting the frequency characteristics of the at least one voice frame through the input layer;

respectively inputting the frequency characteristics of the at least one voice frame into at least one memory unit of a hidden layer in the LSTM model, and performing weighted transformation based on a residual error structure on the frequency characteristics of the at least one voice frame through the at least one memory unit to output a feature vector of the at least one voice frame;

inputting the feature vector of the at least one speech frame into an output layer in the LSTM model, mapping the feature vector of the at least one speech frame to the at least one prediction probability by the output layer.

3. The method according to claim 2, wherein the performing, by the at least one memory unit, a weighted transform based on a residual structure on the frequency characteristic of the at least one speech frame, and outputting the characteristic vector of the at least one speech frame comprises:

for any memory unit, in response to the frequency characteristics of the speech frame corresponding to the memory unit and the processing result of the previous memory unit, performing weighted transformation on the frequency characteristics of the speech frame to obtain an intermediate vector of the speech frame, and fusing the intermediate vector of the speech frame and the processing result of the previous memory unit to obtain the feature vector of the speech frame.

4. The method of claim 2, wherein before inputting the feature vector of the at least one speech frame into an output layer in the LSTM model, the method further comprises:

inputting the feature vector of the at least one voice frame into a projection layer, and determining whether to carry out iterative projection on the at least one voice frame through the projection layer;

in response to determining to perform iterative projection, re-inputting the feature vector of the at least one speech frame into the at least one memory unit for iterative weighted transformation;

in response to determining not to iteratively project, inputting the feature vector of the at least one speech frame into the output layer.

5. The method according to claim 1, wherein after determining the speech keyword contained in the speech to be recognized based on the at least one prediction probability, the method further comprises:

responding to the voice keywords including target keywords, and acquiring target voice for voiceprint recognition;

and carrying out voiceprint recognition on the target voice to obtain a voiceprint recognition result of the target voice, wherein the voiceprint recognition result is used for indicating whether the user to which the target voice belongs is the target user.

6. The method according to claim 5, wherein the performing voiceprint recognition on the target speech to obtain a voiceprint recognition result of the target speech comprises:

inputting the target voice into a voiceprint recognition model, and performing feature extraction on the target voice through the voiceprint recognition model to obtain the noise-containing voiceprint features of the target voice;

denoising the noise-containing voiceprint feature of the target voice to obtain a pure voiceprint feature of the target voice;

and determining the voiceprint recognition result based on the similarity between the pure voiceprint features and the voiceprint features of the target user stored in the voiceprint library.

7. The method according to claim 6, wherein the denoising the noisy voiceprint feature of the target speech to obtain a clean voiceprint feature of the target speech comprises:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. The apparatus of claim 8, wherein the speech recognition model is a long short term memory network (LSTM) model based on a residual structure;

the weighted transformation module includes:

a first extraction unit, configured to input the at least one speech frame into an input layer in the LSTM model, and extract, through the input layer, a frequency feature of the at least one speech frame;

10. The apparatus of claim 9, wherein the weighted transform unit is configured to:

11. The apparatus of claim 9, further comprising:

12. The apparatus of claim 8, wherein the obtaining module is further configured to: responding to the voice keywords including target keywords, and acquiring target voice for voiceprint recognition;

13. The apparatus of claim 12, wherein the voiceprint recognition module comprises:

14. An electronic device, comprising one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded into and executed by the one or more processors to perform operations performed by the speech recognition method of any one of claims 1 to 7.

15. A storage medium having stored therein at least one program code, which is loaded and executed by a processor to perform the operations performed by the speech recognition method according to any one of claims 1 to 7.