CN111210829A

CN111210829A - Speech recognition method, apparatus, system, device and computer readable storage medium

Info

Publication number: CN111210829A
Application number: CN202010102418.9A
Authority: CN
Inventors: 荣康
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2020-05-29
Anticipated expiration: 2040-02-19
Also published as: CN111210829B

Abstract

The present application relates to a speech recognition method, apparatus, system, device and computer-readable storage medium. The method comprises the following steps: acquiring awakening voiceprint characteristics in awakening audio when the terminal is awakened; acquiring voice recognition feedback data of the conversation audio according to the awakening voiceprint feature and the conversation voiceprint feature in the conversation audio after the terminal is awakened; and sending the voice recognition feedback data to the terminal for the terminal to present the voice recognition feedback data. By adopting the method, the audio misidentification of the non-terminal awakening user can be avoided, and the misidentification rate of abnormal voices such as the non-terminal awakening user and noise can be effectively reduced, so that the accuracy of voice identification is effectively improved.

Description

Speech recognition method, apparatus, system, device and computer readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech recognition method, apparatus, system, device, and computer-readable storage medium.

Background

With the rapid development of speech processing technology, full-duplex speech interaction technology is increasingly applied to long-range speech interaction scenes due to the characteristics that the full-duplex speech interaction technology can predict contents to be spoken by human beings in real time, generate responses in real time, control conversation rhythm and the like.

In the traditional technology, a full-duplex semantic anti-noise model is trained to recognize specific noise texts at a cloud end, and then the recognized noise texts are shielded to reduce the probability of noise misrecognition, but when the method is applied to special scenes such as public places in charge of voice, the voice of a user to be recognized speaking at the same time and the voices of other users are easily input and misrecognized at the same time, so that effective audio information in the voice cannot be distinguished.

Therefore, the current voice recognition method has the technical problem of low voice recognition accuracy.

Disclosure of Invention

In view of the above, it is necessary to provide a speech recognition method, apparatus, system, device and computer readable storage medium capable of improving the accuracy of speech recognition.

A method of speech recognition, the method comprising:

acquiring awakening voiceprint characteristics in awakening audio when the terminal is awakened;

acquiring voice recognition feedback data of the conversation audio according to the awakening voiceprint feature and the conversation voiceprint feature in the conversation audio after the terminal is awakened;

and sending the voice recognition feedback data to the terminal for the terminal to present the voice recognition feedback data.

A method of speech recognition, the method comprising:

receiving a voice recognition request initiated by a user through a wake-up audio;

determining a wake word in the wake audio in response to the voice recognition request;

when a wake-up word in the wake-up audio is matched with a preset wake-up word, sending the wake-up audio to a server;

receiving a conversation audio, sending the conversation audio to the server, and obtaining voice recognition feedback data by the server according to a conversation voiceprint feature in the conversation audio and a wakeup voiceprint feature in the wakeup audio;

and receiving voice recognition feedback data of the server.

A speech recognition apparatus, the apparatus comprising:

the terminal comprises a characteristic acquisition module, a voice recognition module and a voice recognition module, wherein the characteristic acquisition module is used for acquiring awakening voiceprint characteristics in awakening audio when the terminal is awakened;

the data acquisition module is used for acquiring voice recognition feedback data of the conversation audio according to the awakening voiceprint characteristics and the conversation voiceprint characteristics in the conversation audio after the terminal is awakened;

and the data sending module is used for sending the voice recognition feedback data to the terminal so that the terminal can present the voice recognition feedback data.

A speech recognition apparatus, the apparatus comprising:

the identification request receiving module is used for receiving a voice identification request initiated by a user through a wake-up audio;

the recognition request response module is used for responding to the voice recognition request and determining a wake-up word in the wake-up audio;

the awakening audio sending module is used for sending the awakening audio to a server when the awakening words in the awakening audio are matched with preset awakening words;

the dialogue audio sending module is used for receiving dialogue audio and sending the dialogue audio to the server, so that the server can obtain voice recognition feedback data according to dialogue voiceprint features in the dialogue audio and awakening voiceprint features in the awakening audio;

and the feedback data receiving module is used for receiving the voice recognition feedback data of the server.

A speech recognition system, the system comprising:

a server and a terminal;

the terminal is used for responding to a voice recognition request initiated by a user through a wake-up audio, determining a wake-up word in the wake-up audio, sending the wake-up audio to the server when the wake-up word in the wake-up audio is matched with a preset wake-up word, receiving a conversation audio, and sending the conversation audio to the server so as to receive voice recognition feedback data of the server;

the server is used for acquiring awakening voiceprint characteristics in awakening audio when the terminal is awakened, acquiring voice recognition feedback data of the conversation audio according to the awakening voiceprint characteristics and conversation voiceprint characteristics in the conversation audio after the terminal is awakened, and finally sending the voice recognition feedback data to the terminal for the terminal to present the voice recognition feedback data.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the voice recognition method, the voice recognition device, the voice recognition system, the voice recognition equipment and the computer readable storage medium, the voice recognition feedback data of the conversation audio can be further obtained according to the awakening voiceprint feature and the conversation voiceprint feature in the conversation audio after the terminal is awakened by obtaining the awakening voiceprint feature in the awakening audio when the terminal is awakened, so that the voice recognition feedback data can be sent to the terminal, and the terminal can present the voice recognition feedback data. By adopting the method, the audio misidentification of the non-terminal awakening user can be avoided, and the misidentification rate of abnormal voices such as the non-terminal awakening user and noise can be effectively reduced, so that the accuracy of voice identification is effectively improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a speech recognition method;

FIG. 2 is a flow diagram illustrating a speech recognition method in one embodiment;

FIG. 3 is a flowchart illustrating the step of obtaining a wake-up voiceprint feature in one embodiment;

FIG. 4 is a flowchart illustrating the wake-up voiceprint feature extraction step in one embodiment;

FIG. 5 is a schematic flow chart diagram illustrating the speech recognition feedback data acquisition step in one embodiment;

FIG. 6 is a flowchart illustrating the wake voiceprint identification determination step in one embodiment;

FIG. 7 is a flow diagram that illustrates the registration of a user with a voiceprint in one embodiment;

FIG. 8 is a flowchart illustrating the wake voiceprint identification determination step in another embodiment;

FIG. 9 is a flowchart illustrating a wake voiceprint identification determination step in yet another embodiment;

FIG. 10 is a flowchart illustrating the classifier model training step in one embodiment;

FIG. 11 is a flowchart illustrating the speech recognition feedback data acquisition step in another embodiment;

FIG. 12 is a flow chart illustrating a speech recognition method according to another embodiment;

FIG. 13 is a flow diagram of a speech recognition method in accordance with one embodiment;

FIG. 14 is a diagram illustrating a multi-scene speech recognition method, according to one embodiment;

FIG. 15 is a block diagram showing the structure of a speech recognition apparatus according to an embodiment;

FIG. 16 is a block diagram showing the construction of a speech recognition apparatus according to another embodiment;

FIG. 17 is a block diagram of the structure of a speech recognition system in one embodiment;

FIG. 18 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that, the Speech Technology (Speech Technology) applied in the present application includes automatic Speech recognition Technology (ASR), Speech synthesis Technology (TTS) and voiceprint recognition Technology, so as to make the computer listen, see, say and feel, which is not only the development direction of human-computer interaction in the future, but also will become one of the best human-computer interaction modes in the future.

It should be noted that the speech recognition method provided by the present application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may send the voice to the server 104 through the network while continuously receiving the voice of the user, so that the server 104 can continuously perform voice recognition on the voice while continuously receiving the voice transmitted by the terminal 102, and then perform voice conversation (feedback of information required by the user) with the user through the terminal 102 on the basis that certain preset features are effectively recognized. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers, and the network includes, but is not limited to: a wide area network, a metropolitan area network, or a local area network.

In one embodiment, as shown in fig. 2, a speech recognition method is provided, which is exemplified by the application of the method to the server 104 in fig. 1, and includes the following steps:

step 202, acquiring a wake-up voiceprint feature in the wake-up audio when the terminal is woken up.

The audio may refer to sound signals that can be heard by people and can be stored in a computer, such as speaking voice, singing voice, musical instrument voice, noise and the like. And the wake-up audio may refer to a sound wake-up signal having special information that can trigger the terminal 102 to respond.

The voiceprint may refer to a sound wave spectrum carrying speech information, the voiceprint feature may refer to voiceprint flag information representing the sound wave spectrum, and the wake-up voiceprint feature may refer to a voiceprint wake-up feature in a wake-up audio that triggers the terminal 102 to respond.

Specifically, the terminal 102 and the server 104 establish a communication connection through a network, before the server 104 performs voice recognition, a wake-up audio sent by the terminal 102 needs to be acquired, the wake-up audio has not only specific information that can trigger the terminal 102 to respond to a voice request of a user, but also a voiceprint feature of the user, namely the wake-up voiceprint feature, and the server 104 acquires the wake-up voiceprint feature in the wake-up audio when the terminal 102 is woken up, and can use the wake-up voiceprint feature as a basis for subsequent voice recognition processing, so as to achieve the purpose of matching the user identity by using the voiceprint feature.

And 204, acquiring voice recognition feedback data of the dialogue audio according to the awakening voiceprint characteristics and the dialogue voiceprint characteristics in the dialogue audio after the terminal is awakened.

The dialogue audio may refer to a user voice received after the terminal 102 is awakened.

Wherein, the dialog voiceprint feature can refer to the voiceprint feature of the user in the dialog audio.

For example, if the text information in the dialogue audio is "what is the weather today", the voice recognition feedback data may be weather data, such as temperature, air quality, and the like, which is queried by the server 104 through the internet.

Specifically, in order to improve the accuracy of voice recognition, the server 104 needs to acquire not only the awakening voiceprint feature in the awakening audio, but also the dialogue voiceprint feature in the dialogue audio after the terminal 102 is awakened, and analyzes the voiceprint features of the users at different time points, so as to determine whether the user currently performing voice interaction with the terminal 102 is the user who previously triggered the terminal 102 to respond to subsequent voice information, thereby determining whether the users performing voice interaction with the terminal 102 at two different time points are the same person, or shielding the voice or noise of the non-awakening terminal user in a scene containing numerous voices or noises, such as a public place, and only responding to the voice recognition request of the user matching the two acquired voiceprint features, and querying and acquiring the voice recognition feedback data of the user request.

For example, the server 104 analyzes and determines that the current wake-up voiceprint feature matches the dialogue voiceprint feature, and then may query the information "what is the weather today" in the dialogue audio, and the obtained weather data may be used as the speech recognition feedback data of the dialogue audio.

And step 206, sending the voice recognition feedback data to the terminal, so that the terminal can present the voice recognition feedback data.

Specifically, the voice recognition feedback data may include a voice stream containing text information to be broadcasted, for example, the text information to be broadcasted in the voice stream may be the opening information of the weather forecast, and/or the feedback data may be the weather data "temperature, air quality", and the like.

In the voice recognition method, the voice recognition feedback data of the conversation audio can be further acquired according to the awakening voiceprint feature and the conversation voiceprint feature in the conversation audio after the terminal is awakened by acquiring the awakening voiceprint feature in the awakening voiceprint when the terminal is awakened, so that the voice recognition feedback data can be conveniently sent to the terminal, and the terminal can present the voice recognition feedback data. By adopting the method, the audio misidentification of the non-terminal awakening user can be avoided, and the misidentification rate of abnormal voices such as the non-terminal awakening user and noise can be effectively reduced, so that the accuracy of voice identification is effectively improved.

In one embodiment, as shown in FIG. 3, step 202 comprises:

step 302, acquiring a wake-up audio received when the terminal is awakened by a preset wake-up word.

The preset wake-up word may be a wake-up word that is preset and stored in the terminal 102 and can trigger the terminal 102 to respond to a subsequent instruction, such as a wake-up word "jingle" set by the factory of the intelligent voice device, or a wake-up word "comprehend" set by the user.

Specifically, before acquiring the wake-up voiceprint feature, the server 104 first needs to acquire a wake-up audio received when the terminal 102 is woken up by a preset wake-up word, and for a wake-up audio meeting a preset wake-up word condition, effective extraction of the wake-up voiceprint feature may be performed, otherwise, the wake-up audio with the wake-up voiceprint feature may not be acquired.

And step 304, extracting user voiceprint features in the awakening audio to serve as the awakening voiceprint features.

The user voiceprint feature may refer to a voiceprint feature of a voice user in the wake-up audio.

Specifically, the user voiceprint features in the wake-up audio meeting the preset wake-up word condition can be further extracted as the wake-up voiceprint features.

In the embodiment, by extracting the voiceprint features of the user in the wake-up audio, not only can the wake-up voiceprint features with higher accuracy than that of the common audio be obtained, but also the false recognition rate of abnormal voice can be effectively reduced, so that the accuracy of voice recognition is effectively improved.

In one embodiment, as shown in FIG. 4, step 304 comprises:

step 402, framing the wake-up audio to obtain at least one wake-up audio frame.

Step 404, windowing the at least one wake-up audio frame to obtain at least one wake-up windowed audio frame.

Step 406, extracting mel-frequency cepstrum coefficients of the at least one wake-up windowed audio frame as the wake-up voiceprint features.

Among them, Mel-Frequency Cepstral Coefficients (MFCCs) are Coefficients constituting Mel-Frequency Cepstral, and in the field of sound processing, Mel-Frequency Cepstrum is a linear transformation of a logarithmic energy spectrum based on a nonlinear Mel scale (Mel scale) of sound frequencies.

Specifically, after a section of wake-up audio is acquired, the distribution of each frequency component in the audio needs to be clarified, a fourier transform tool can be used for analyzing the distribution of the audio frequency components, and the specific operation includes framing the wake-up audio (for example, one frame every 20 ms), windowing the framed wake-up audio frame by using the weighting of a movable window function with a specified length, and further extracting the mel-frequency cepstrum coefficient of each wake-up windowed audio frame as a wake-up voiceprint feature.

More specifically, before the wake-up audio is framed, the wake-up audio may be subjected to endpoint detection in advance, that is, a speech start point and a speech end point are found from the current wake-up audio, for example, endpoint detection is performed by using a double-threshold method.

In this embodiment, the wake-up audio is subjected to framing, windowing and other processing, so that the voice signal can be accurately analyzed, the voice recognition rate can be improved, the calculation amount can be reduced and the processing time can be shortened by endpoint detection, noise interference of a silent section can be eliminated, and the voice recognition accuracy can be improved.

In one embodiment, as shown in FIG. 5, step 204 comprises:

step 502, determining a wake-up voiceprint identifier of the wake-up voiceprint feature.

Wherein, the wake-up voiceprint identifier may refer to a globally unique character string capable of characterizing the identity of the user, for example, 12345, a23d4, etc.

Specifically, the method for determining the awakening voiceprint identifier may include multiple methods, for example, when the server 104 stores the voiceprint identifier registered by the user in advance, the voiceprint feature matching may be performed on the awakening voiceprint feature and the pre-stored registration identifier in a voiceprint feature matching manner, so as to determine the awakening voiceprint identifier of the current awakening voiceprint feature, and for example, when the server 104 does not store the voiceprint identifier registered by the user in advance, or does not need the user to register the voiceprint before waking up the device, the model output result input by using the awakening voiceprint feature as the model input may be obtained in a model training manner, so as to obtain the awakening voiceprint identifier.

And 504, acquiring voice recognition feedback data of the dialogue audio according to the awakening voiceprint identification and the dialogue voiceprint characteristics in the dialogue audio after the terminal is awakened.

Specifically, according to the matching result of the awakening voiceprint identifier and the dialogue voiceprint feature, voice recognition feedback data of the dialogue audio can be obtained.

In the embodiment, the acquisition of the voice recognition feedback data is judged by analyzing the result between the awakening voiceprint identifier and the dialogue voiceprint feature, so that the calculation efficiency is improved, and the accuracy of the voice recognition can be effectively improved.

In one embodiment, as shown in FIG. 6, step 502 includes:

step 602, determining a wake-up voiceprint identifier in at least one pre-stored registered voiceprint identifier according to the wake-up voiceprint feature; the at least one registration voiceprint identifier has corresponding registration voiceprint characteristics respectively; the awakening voiceprint identifier is a registered voiceprint identifier of a registered voiceprint feature matched with the awakening voiceprint feature.

The registered voiceprint identifier may refer to an identifier of a registered voiceprint.

Specifically, in an actual application scenario, before the user triggers the terminal 102 to respond to the instruction, the user needs to register the identity information of the user, and registers and generates a globally unique registered voiceprint identifier by using different characteristics of each voiceprint, and stores the globally unique registered voiceprint identifier in the database in the server 104, so that when the subsequent user wakes up the terminal 102, the features of the woken-up voiceprint feature are respectively matched with at least one registered voiceprint identifier pre-stored in the server 104, and thus the registered voiceprint identifier of the registered voiceprint feature matched with the woken-up voiceprint feature is determined, that is, the woken-up voiceprint identifier.

More specifically, the flow of user registration voiceprint can be seen in fig. 7. As shown in fig. 7, the user may submit a wakeup audio with a preset wakeup word "jingle" to the terminal 102 in sequence according to the guidance, after the user submits the wakeup audio for the first time, the server 104 performs voice recognition on the wakeup audio, and if it is recognized that the voiceprint feature in the audio is registered, a registered prompt may be generated, and the prompt is sent to the terminal 102 for display; if the server 104 does not feed back the registered prompt to the terminal 102 after the user submits the wake-up audio for the first time, the wake-up audio submitted by the user for voiceprint registration needs to be continuously received until the preset number of times of submission is reached, model training calculation is performed on voiceprint features in the wake-up audio submitted by the user, and finally the registered voiceprint identifier of the user is obtained and stored in the database.

In the embodiment, the awakening voiceprint identifier of the awakening voiceprint feature is determined through the prestored registered voiceprint identifier, so that the user identity of the awakening audio can be rapidly determined, and the accuracy of voice recognition can be effectively improved.

In one embodiment, as shown in FIG. 8, step 602 comprises:

step 802, determining at least one pre-stored registration voiceprint identifier; the at least one registration voiceprint identifier has a registration voiceprint feature, respectively.

Step 804, calculating the feature similarity of the registered voiceprint feature and the awakening voiceprint feature.

Step 806, determining the registered voiceprint identifier of the registered voiceprint feature with the feature similarity reaching the preset similarity threshold and being the maximum value as the awakening voiceprint identifier.

Specifically, there are a plurality of algorithms for calculating the feature similarity between the registered voiceprint feature and the awakening voiceprint feature, such as minkowski distance, manhattan distance, euclidean distance, chebyshev distance, or cosine similarity calculation method, and the range of the calculated similarity may be represented as a numerical range, such as 0 to 1, 0 to 10, or a percentage range, such as 0 to 100%.

More specifically, if there are a plurality of registered voiceprint features, the plurality of registered voiceprint features and the wakeup voiceprint feature are subjected to similarity analysis, so as to obtain a plurality of feature similarities, for the plurality of feature similarities, a preset similarity threshold value is used for preliminary screening, and if there is still more than one feature similarity meeting the preset similarity threshold value, the feature similarity with the largest value is extracted and the corresponding registered voiceprint feature is determined to serve as the wakeup voiceprint identifier, so that the server 104 can determine the identity of the user currently waking up the terminal 102 by using the pre-registered voiceprint identifier.

In this embodiment, the awakening voiceprint identifier of the awakening voiceprint feature is determined by calculating and screening the feature similarity between the registered voiceprint feature and the awakening voiceprint feature, so that the accuracy of voice recognition can be improved.

In one embodiment, as shown in FIG. 9, step 502 includes:

and step 902, training a voiceprint classifier by using the awakening voiceprint characteristics.

And 904, acquiring the optimal structure data of the trained voiceprint classifier as the awakening voiceprint identifier.

Wherein the voiceprint classifier may include at least one of a Gaussian mixture classifier (GMM), a convolutional neural network Classifier (CNN), a recurrent neural network classifier (RNN), a deep neural network classifier (DNN), and a Support Vector Machine (SVM).

Specifically, in this embodiment, it is proposed that a gaussian mixture classifier may be used to perform voiceprint feature classification training, that is, the awakened voiceprint features may be input into the gaussian mixture classifier to be trained, so as to obtain an optimal structural data-gaussian parameter set of the trained voiceprint classifier, where the optimal structural data-gaussian parameter set includes gaussian component parameter vectors (mean and standard layers), weight coefficient vectors, and the like, and the gaussian parameter set may represent an awakened voiceprint identifier.

More specifically, if the wake-up voiceprint feature is actually an N-frame mel-frequency cepstrum coefficient set after framing and windowing, the wake-up voiceprint feature can be input into a voiceprint classifier formulated according to service requirements, and used for acquiring a wake-up voiceprint identifier.

It should be noted that, in the classifier model training process related in this embodiment, specific processing steps in practical application may refer to fig. 10, as shown in fig. 10, training of a voiceprint model aims to obtain a voiceprint identifier of a user submitting a wake-up audio, and the step of obtaining the voiceprint identifier may be applied to obtaining a registered voiceprint identifier in the foregoing embodiment, and may also be applied to a voice interaction scenario without registering a voiceprint in advance, that is, when the user wakes up the terminal 102, when the model is trained immediately, the voiceprint identifier obtaining process mainly includes: (1) and (3) inputting the wake-up audio: inputting a preset number of continuous awakening audios into a model training thread; (2) audio preprocessing: preprocessing the awakening audio, including performing end point detection, framing, windowing and other operations on the awakening audio; (3) voiceprint feature extraction: performing Mel Frequency Cepstrum Coefficient (MFCC) feature extraction on each frame of audio data to obtain voiceprint features; (4) model training: training according to a machine classification model: and analyzing and calculating the N-frame MFCC array to obtain a GMM Gaussian parameter group, wherein the GMM Gaussian parameter group is a voiceprint model, and representing to obtain a voiceprint identifier.

In addition, fig. 10 further includes a voiceprint matching result obtaining process, which is mainly used for two aspects, one is a matching sub-process of the wake-up audio feature in the wake-up audio and the registered voiceprint identifier, the matching sub-process can determine a unique wake-up audio identifier matching with the wake-up audio feature, and the other is a matching sub-process of the dialogue voiceprint feature in the dialogue audio and the wake-up audio identifier, and the matching sub-process can determine whether to obtain the voice recognition feedback data required by the user in the dialogue audio. The voiceprint matching result obtaining process mainly comprises the following steps: (1) audio input: inputting wake-up audio and/or dialogue audio; (2) audio preprocessing: consistent with the preprocessing step in the training process, which is not described herein again; (3) voiceprint feature extraction: consistent with the feature extraction step in the training process, the description is omitted here; (4) matching a voiceprint matching model: and (4) adopting a GMM model, taking the MFCC features extracted in the last step as input, and outputting a matching result (yes/no) by using the GMM model.

In the embodiment, the voiceprint awakening identifier of the voiceprint awakening feature is obtained based on the model training result, so that the accuracy of voice recognition can be effectively improved.

In one embodiment, as shown in FIG. 11, step 504 includes:

and 1102, matching the awakening voiceprint identifier with the dialogue voiceprint feature in the dialogue audio after the terminal is awakened by adopting the voiceprint classifier trained by the awakening voiceprint feature in advance.

And 1104, if the awakening voiceprint identifier is matched with the dialogue voiceprint feature, acquiring text query feedback data of a dialogue text in the dialogue audio as the voice recognition feedback data.

Specifically, if the awakening voiceprint identifier is matched with the dialogue voiceprint feature, namely the matching result of model output is yes, text query feedback data of a dialogue text in dialogue audio is obtained and used as voice recognition feedback data; if the awakening voiceprint identifier is not matched with the dialogue voiceprint feature, namely the matching result of model output is 'no', generating a voiceprint registration prompt, and sending the voiceprint registration prompt to the terminal 102, so that the terminal 102 displays the voiceprint registration prompt and/or plays a voiceprint registration prompt tone to prompt a user to register the voiceprint.

More specifically, step 1104 specifically further includes: if the awakening voiceprint identification is matched with the dialogue voiceprint characteristics, acquiring a dialogue result based on a voice recognition text in the dialogue audio, and carrying out voice synthesis on the dialogue result to obtain voice recognition feedback data; the voice recognition feedback data comprises voice streams and/or recognition feedback data; the voice stream is used for the terminal 102 to play voice; the identification feedback data is used for data display of the terminal 102.

In the embodiment, the audio misidentification of the non-terminal awakening user can be avoided, so that the misidentification rate of abnormal voices such as the non-terminal awakening user and noise can be effectively reduced, and the accuracy of voice identification can be effectively improved.

In one embodiment, as shown in fig. 12, another speech recognition method is provided, which is described by taking the example that the method is applied to the terminal 102 in fig. 1, and includes the following steps:

step 1202, a voice recognition request initiated by a user through a wake-up audio is received.

Step 1204, in response to the voice recognition request, determining a wake-up word in the wake-up audio.

In step 1206, when the wake-up word in the wake-up audio matches a preset wake-up word, the wake-up audio is sent to a server.

And 1208, receiving a conversation audio, sending the conversation audio to the server, and obtaining voice recognition feedback data by the server according to the conversation voiceprint feature in the conversation audio and the awakening voiceprint feature in the awakening audio.

Step 1210, receiving voice recognition feedback data of the server.

Specifically, the terminal 102 may receive a voice recognition request initiated by a user through a wake-up audio in a standby state, further analyze and recognize a wake-up word in the wake-up audio in response to the voice recognition request, and then match the wake-up word with at least one preset wake-up word one by one, if the wake-up word in the wake-up audio matches any one of the preset wake-up words, the terminal 102 further sends the verified and valid wake-up audio to the server 104, so that the server 104 recognizes the wake-up audio and collects wake-up voiceprint features.

Meanwhile, based on the full-duplex voice interaction technology adopted by the application, the terminal 102 can receive the dialogue audio submitted again by the user while receiving the wake-up audio and waiting for the feedback of the server 104, and further match the dialogue voiceprint feature in the dialogue audio with the wake-up voiceprint feature (the matched wake-up voiceprint identifier), and during the matching of the server 104, the voice text of the dialogue audio can be recognized through other services, that is, the server 104 can analyze the audio submitted by the user in real time by using multiple services and multiple threads. If the voice print feature of the dialog analyzed by the server 104 matches the voice print feature of the wake-up (wake-up voice print identifier), the user may request a dialog result through the internet using the recognized voice text, and the dialog result is then subjected to voice synthesis to obtain voice recognition feedback data that can be fed back to the server 104.

For the voice stream and/or the recognition feedback data included in the voice recognition feedback data, the terminal 102 may perform voice playing on the voice stream, and perform data display on the recognition feedback data through an interactive interface.

In the voice recognition method, the voice recognition feedback data of the conversation audio can be further acquired according to the awakening voiceprint feature and the conversation voiceprint feature in the conversation audio after the terminal is awakened by acquiring the awakening voiceprint feature in the awakening voiceprint when the terminal is awakened, so that the voice recognition feedback data can be conveniently sent to the terminal, and the terminal can present the voice recognition feedback data. By adopting the scheme, the audio misidentification of the non-terminal awakening user can be avoided, and the misidentification rate of abnormal voices such as the non-terminal awakening user and noise can be effectively reduced, so that the accuracy of voice identification is effectively improved.

It should be understood that although the various steps in the flowcharts of fig. 2-6, 8-9, 11-12 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-6, 8-9, 11-12 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or at least partially with other steps or with at least some of the other steps.

The application also provides an application scene, and the application scene applies the voice recognition method. Specifically, the application of the speech recognition method in the application scenario is described as follows with reference to fig. 13:

as shown in fig. 13, the speech recognition method can be applied to a full-duplex speech interaction scenario between a user and an intelligent speech device, which is described by taking a timing diagram shown in fig. 13 as an example, and specifically includes the following steps: (1301-1302) after the user submits the awakening audio with the preset awakening word to the intelligent voice device, the intelligent voice device sends the obtained awakening audio to the cloud; (1303 and 1305), the cloud end calls a voice recognition service to acquire awakening voiceprint characteristics of the awakening audio so that the voice recognition service calls the voiceprint service to match the acquired awakening voiceprint characteristics with at least one pre-stored registered voiceprint identifier, namely, whether the awakening audio currently submitted by the user is matched with the pre-stored registered voiceprint of the intelligent voice device is judged, and if the awakening voiceprint identifier is matched, the matched awakening voiceprint identifier is fed back to the voice recognition service; (1306), after the user submits the conversation audio to the intelligent voice device, the intelligent voice device transmits the conversation audio to the voice recognition service through the cloud end; (1309-; (1311) if the matching result obtained by the voice recognition service is yes, feeding back the voice text 'how the weather is today' to the cloud end; if the matching result is obtained to be 'no', feeding the empty text back to the cloud end; (1312-1314) if the cloud acquires the voice text 'how the weather is today', calling the semantic understanding service to identify the voice text so as to acquire data fed back by the semantic understanding service, such as weather data and a text to be broadcasted, and further sending the text to be broadcasted to the voice synthesis service so that the voice synthesis service feeds back a synthesized voice stream after performing voice synthesis on the text to be broadcasted; (1315 + 1316) the cloud sends the acquired voice stream and the acquired weather data to the intelligent voice device, so that the intelligent voice device plays the voice stream, and meanwhile, the weather data is displayed through the interactive interface.

The application further provides an application scenario, and the application scenario applies the voice recognition method. Specifically, the application of the speech recognition method in the application scenario is described as follows with reference to fig. 14:

(1) scenario one (normal conversation): after a user submits a wakeup audio with a preset wakeup word 'jingle' to the intelligent voice equipment, the intelligent voice equipment sends the acquired wakeup audio to a cloud end; when the cloud acquires the dialogue audio submitted by the user, namely 'how the weather is today' and the voiceprint feature in the dialogue audio is identified to be matched with the voiceprint feature in the awakening audio in a consistent manner, the cloud can feed back data 'the weather is good today and the temperature is 23 ℃ to the intelligent voice device'.

(2) Scene two (other people talk and miss reception): on the basis of understanding the full-duplex conversation process in the scene one, if the voiceprint ID of the audio received by the current intelligent voice device, which is "how the weather is", is inconsistent with the voiceprint ID during waking up, the intelligent voice device does not perform any feedback.

(3) Scene three (noise misidentification): on the basis of understanding the full-duplex conversation process in the scene one, if the current noise is mistakenly received and recognized by the intelligent voice equipment, and the voiceprint ID is inconsistent with the voiceprint ID during awakening, the intelligent voice equipment does not make any feedback.

The embodiment explains the specific application of the voice recognition method in different scenes respectively, and by adopting the voice recognition method provided by the application, the audio misrecognition of the non-terminal awakening user can be avoided, so that the misrecognition rate of abnormal voices such as the non-terminal awakening user and noise is effectively reduced, and the accuracy of the voice recognition is effectively improved.

In one embodiment, as shown in fig. 15, there is provided a speech recognition apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a feature acquisition module 1502, a data acquisition module 1504, and a data transmission module 1506, wherein:

a feature obtaining module 1502, configured to obtain a wake-up voiceprint feature in a wake-up audio when the terminal is woken up;

a data obtaining module 1504, configured to obtain voice recognition feedback data of the dialogue audio according to the awakening voiceprint feature and a dialogue voiceprint feature in the dialogue audio after the terminal is awakened;

the data sending module 1506 is configured to send the speech recognition feedback data to the terminal, so that the terminal can present the speech recognition feedback data.

In an embodiment, the feature obtaining module 1502 is further configured to obtain a wake-up audio received when the terminal is woken up by a preset wake-up word; and extracting the user voiceprint characteristics in the awakening audio to be used as the awakening voiceprint characteristics.

In an embodiment, the feature obtaining module 1502 is further configured to frame the wake-up audio to obtain at least one wake-up audio frame; windowing the at least one awakening audio frame to obtain at least one awakening windowed audio frame; extracting mel-frequency cepstrum coefficients of the at least one wake-up windowed audio frame as the wake-up voiceprint features.

In one embodiment, the data acquisition module 1504 is further configured to determine a wake voiceprint identification of the wake voiceprint feature; and acquiring voice recognition feedback data of the dialogue audio according to the awakening voiceprint identification and the dialogue voiceprint characteristics in the dialogue audio after the terminal is awakened.

In an embodiment, the data obtaining module 1504 is further configured to determine a wake-up voiceprint identifier from at least one pre-stored registered voiceprint identifier according to the wake-up voiceprint feature; the at least one registration voiceprint identifier has corresponding registration voiceprint characteristics respectively; the awakening voiceprint identifier is a registered voiceprint identifier of a registered voiceprint feature matched with the awakening voiceprint feature.

In one embodiment, the data obtaining module 1504 is further configured to determine at least one pre-stored registered voiceprint identifier; the at least one registration voiceprint identifier has a registration voiceprint feature respectively; calculating the feature similarity of the registered voiceprint features and the awakening voiceprint features; and determining the registered voiceprint identifier of the registered voiceprint feature with the feature similarity reaching a preset similarity threshold and being the maximum value as the awakening voiceprint identifier.

In one embodiment, the data acquisition module 1504 is further configured to train a voiceprint classifier using the wake voiceprint features; and acquiring the optimal structure data of the trained voiceprint classifier as the awakening voiceprint identifier.

In an embodiment, the data obtaining module 1504 is further configured to match the awakening voiceprint identifier with a dialog voiceprint feature in a dialog audio after the terminal is awakened, through a voiceprint classifier trained by using the awakening voiceprint feature in advance; and if the awakening voiceprint identification is matched with the dialogue voiceprint characteristics, acquiring text query feedback data of the dialogue text in the dialogue audio as the voice recognition feedback data.

In one embodiment, as shown in fig. 16, another speech recognition apparatus is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an identification request receiving module 1602, an identification request responding module 1604, a wake-up audio transmitting module 1606, a dialogue audio transmitting module 1608, and a feedback data receiving module 1610, wherein:

a recognition request receiving module 1602, configured to receive a voice recognition request initiated by a user through a wake-up audio;

a recognition request response module 1604, configured to determine a wake word in the wake audio in response to the voice recognition request;

a wake-up audio sending module 1606, configured to send the wake-up audio to a server when a wake-up word in the wake-up audio matches a preset wake-up word;

a dialogue audio sending module 1608, configured to receive a dialogue audio, and send the dialogue audio to the server, so that the server obtains voice recognition feedback data according to a dialogue voiceprint feature in the dialogue audio and a wakeup voiceprint feature in the wakeup audio;

a feedback data receiving module 1610, configured to receive voice recognition feedback data of the server.

For the specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, which are not described herein again. The respective modules in the above-described speech recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 17, another speech recognition system is provided, which may be a part of a computer device using software modules or hardware modules, or a combination of both, and specifically includes: a terminal 1702 and a server 1704;

the terminal 1702, configured to determine a wake-up word in a wake-up audio in response to a voice recognition request after receiving a voice recognition request initiated by a user through the wake-up audio, and further send the wake-up audio to the server when the wake-up word in the wake-up audio matches a preset wake-up word, and receive a conversation audio at the same time, and send the conversation audio to the server, so as to receive voice recognition feedback data of the server;

the server 1704 is configured to obtain a wake-up voiceprint feature in a wake-up audio when the terminal is woken up, obtain voice recognition feedback data of the conversation audio according to the wake-up voiceprint feature and a conversation voiceprint feature in a conversation audio after the terminal is woken up, and finally send the voice recognition feedback data to the terminal, so that the terminal presents the voice recognition feedback data.

For the specific definition of the speech recognition system, reference may be made to the above definition of the speech recognition method, which is not described herein again. The various modules in the speech recognition system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 18. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing audio data such as voiceprint information and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 18 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein obtaining the voiceprint wake-up feature in the wake-up audio when the terminal is awake comprises:

acquiring a wake-up audio received when a terminal is awakened by a preset wake-up word;

and extracting the user voiceprint characteristics in the awakening audio to be used as the awakening voiceprint characteristics.

3. The method according to claim 2, wherein the extracting the user voiceprint features in the wake audio as the wake voiceprint features comprises:

framing the awakening audio to obtain at least one awakening audio frame;

windowing the at least one awakening audio frame to obtain at least one awakening windowed audio frame;

extracting mel-frequency cepstrum coefficients of the at least one wake-up windowed audio frame as the wake-up voiceprint features.

4. The method according to claim 1, wherein the obtaining of the voice recognition feedback data of the dialogue audio according to the awakening voiceprint feature and the dialogue voiceprint feature in the dialogue audio after the terminal is awakened comprises:

determining a wake voiceprint identifier of the wake voiceprint feature;

and acquiring voice recognition feedback data of the dialogue audio according to the awakening voiceprint identification and the dialogue voiceprint characteristics in the dialogue audio after the terminal is awakened.

5. The method of claim 4, wherein the determining the voiceprint identification of the voiceprint wake characteristic comprises:

determining a voiceprint awakening identifier in at least one pre-stored registered voiceprint identifier according to the voiceprint awakening feature; the at least one registration voiceprint identifier has corresponding registration voiceprint characteristics respectively; the awakening voiceprint identifier is a registered voiceprint identifier of a registered voiceprint feature matched with the awakening voiceprint feature.

6. The method according to claim 5, wherein the determining a wake voiceprint identifier from at least one pre-stored registered voiceprint identifier according to the wake voiceprint feature comprises:

determining at least one pre-stored registration voiceprint identifier; the at least one registration voiceprint identifier has a registration voiceprint feature respectively;

calculating the feature similarity of the registered voiceprint features and the awakening voiceprint features;

and determining the registered voiceprint identifier of the registered voiceprint feature with the feature similarity reaching a preset similarity threshold and being the maximum value as the awakening voiceprint identifier.

7. The method of claim 4, wherein the determining the voiceprint identification of the voiceprint wake characteristic comprises:

training a voiceprint classifier by adopting the awakening voiceprint characteristics;

and acquiring the optimal structure data of the trained voiceprint classifier as the awakening voiceprint identifier.

8. The method according to claim 4, wherein the obtaining of the voice recognition feedback data of the dialogue audio according to the dialogue voiceprint identification and the dialogue voiceprint feature in the dialogue audio after the terminal is awakened comprises:

matching the awakening voiceprint identification with the dialogue voiceprint characteristics in the dialogue audio frequency after the terminal is awakened through a voiceprint classifier which is trained by adopting the awakening voiceprint characteristics in advance;

and if the awakening voiceprint identification is matched with the dialogue voiceprint characteristics, acquiring text query feedback data of the dialogue text in the dialogue audio as the voice recognition feedback data.

9. The method of any one of claims 7-8, wherein the voiceprint classifier comprises at least one of a Gaussian mixture classifier (GMM), a convolutional neural network Classifier (CNN), a recurrent neural network classifier (RNN), a deep neural network classifier (DNN), a Support Vector Machine (SVM).

10. A method of speech recognition, the method comprising:

and receiving voice recognition feedback data of the server.

11. A speech recognition apparatus, characterized in that the apparatus comprises:

12. A speech recognition apparatus, characterized in that the apparatus comprises:

13. A speech recognition system, the system comprising:

a server and a terminal;

14. A speech recognition device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method according to any one of claims 1 to 10 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.