CN117456999A

CN117456999A - Audio identification method, audio identification device, vehicle, computer device, and medium

Info

Publication number: CN117456999A
Application number: CN202311800969.2A
Authority: CN
Inventors: 张辽; 余骁捷
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-01-26
Anticipated expiration: 2043-12-25

Abstract

The application discloses an audio identification method, an audio identification device, a vehicle, computer equipment and a medium. The method comprises the following steps: encoding the audio to be identified to generate a pronunciation probability matrix; correcting the pronunciation probability matrix through a preset delay error correction model to obtain a corresponding word result and an output probability matrix; performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix; and inputting the target probability matrix into the voice decoding graph for decoding so as to obtain a recognition result. The pronunciation probability matrix is corrected through the error correction capability of the delay error correction model, so that an accurate word result is obtained, the accuracy of audio identification is improved, and the delay error correction model can effectively save calculation force, storage space and the like.

Description

Audio identification method, audio identification device, vehicle, computer device, and medium

Technical Field

The present application relates to the field of speech recognition technology, and more particularly, to an audio recognition method, an audio recognition apparatus, a vehicle, a computer device, and a non-volatile computer-readable storage medium.

Background

Currently, speech recognition of vehicles generally employs an attention-based end-to-end acoustic model scheme. An end-to-end acoustic model scheme based on attention mechanisms (attention) can map input speech signals more accurately into output text labels.

However, when the vehicle is in an offline state, due to the limitation of chip computing power, the computing power and enough storage space required during operation cannot be obtained by the end-to-end acoustic model based on the attention mechanism (attention), so that the interaction processing speed, accuracy and the like of voice recognition are low in the offline state of the vehicle, and user experience is affected.

Disclosure of Invention

Embodiments of the present application provide an audio recognition method, an audio recognition apparatus, a vehicle, a computer device, and a non-volatile computer-readable storage medium. The audio to be identified is subjected to delay error correction through the delay error correction model, and the identification result with high accuracy is decoded based on the delay error correction result, so that the calculation force and the storage space are effectively saved, and the accuracy of audio identification is ensured.

The audio recognition method comprises the steps of encoding audio to be recognized to generate a pronunciation probability matrix; correcting the pronunciation probability matrix through a preset delay error correction model to obtain a corresponding word result and an output probability matrix; performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix; and inputting the target probability matrix into a voice decoding diagram for decoding so as to obtain a recognition result.

In some embodiments, the encoding the audio to be identified to generate the pronunciation probability matrix includes: encoding the audio to be identified to output a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model and to output a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the first output layer and the second output layer are mutually independent, the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function; and splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate the pronunciation probability matrix.

In some embodiments, the stitching the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate the pronunciation probability matrix includes: aligning the first peak path with the second peak path, and determining peaks with the same label as a splicing starting point; determining a first peak for splicing in a first peak path and a second peak for splicing in a preset frame number in a second peak path according to the splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak with the second probability matrix of the frame number position corresponding to the second peak to obtain the pronunciation probability matrix.

In some embodiments, the delay correction model is independent of a preset acoustic model and is obtained by training according to preset unsupervised ZhuYin text data.

In some embodiments, the performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix includes: adding probabilities of words corresponding to the same tone as probabilities of the corresponding tone to obtain the mapping probability matrix; and accumulating the probability matrixes corresponding to the same sound in the mapping probability matrix and the pronunciation probability matrix, and then taking an average value to generate the target probability matrix.

In certain embodiments, further comprising: receiving a voice request sent by a user in a vehicle to generate the audio to be identified; or receiving a voice request received by a terminal associated with the vehicle to generate the audio to be identified.

The audio identification device comprises an encoding module, an error correction module, a mapping module and a decoding module, wherein the encoding module is used for encoding the audio to be identified so as to generate a pronunciation probability matrix; the error correction module is used for correcting the pronunciation probability matrix through a preset delay error correction model so as to obtain a corresponding word result and an output probability matrix; the mapping module is used for carrying out pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix; the decoding module is used for inputting the target probability matrix into the voice decoding graph to decode so as to obtain a recognition result.

The vehicle of the embodiment of the application comprises a processor and a memory; and a computer program stored in the memory and executed by the processor, the computer program including instructions for performing the audio recognition method of any of the above embodiments.

The computer device of the embodiment of the application comprises a processor and a memory; and a computer program stored in the memory and executed by the processor, the computer program including instructions for performing the audio recognition method of any of the above embodiments.

The non-transitory computer readable storage medium of the present embodiment includes a computer program, which when executed by a processor, causes the processor to execute the audio recognition method of any of the above embodiments.

According to the audio identification method, the audio identification device, the vehicle, the computer equipment and the nonvolatile computer readable storage medium, the acquired audio to be identified is encoded to generate the pronunciation probability matrix, the pronunciation probability matrix is corrected through the preset delay error correction model, the corresponding word result and the output probability matrix are obtained, the pronunciation probability matrix is corrected through the error correction capability of the delay error correction model, so that an accurate word result is obtained, the accuracy of audio identification is ensured, and the delay error correction model does not need to be based on the current attention module, so that calculation power can be effectively saved, and storage space is saved.

And then, carrying out pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix, wherein the word result and the output probability matrix are subjected to error correction, so that the accuracy of the recognition result is improved, the more accurate probability distribution relation of the voice is obtained by generating the target probability matrix according to the mapping probability matrix and the pronunciation probability matrix, and finally, the target probability matrix is input into a voice decoding diagram for decoding, thereby improving the accuracy of the recognition result.

Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of embodiments of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a schematic view of an application scenario of an audio recognition method according to some embodiments of the present application;

FIG. 2 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 3 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 4 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 5 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 6 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 7 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 8 is a schematic illustration of a scenario of an audio recognition method of certain embodiments of the present application;

FIG. 9 is a flow chart of an audio recognition method of some embodiments of the present application;

FIG. 10 is a schematic illustration of a scenario of an audio recognition method of certain embodiments of the present application;

FIG. 11 is a block diagram of an audio recognition device according to some embodiments of the present application;

FIG. 12 is a schematic structural diagram of a computer device according to some embodiments of the present application;

fig. 13 is a schematic diagram of a connection state of a non-volatile computer readable storage medium and a processor according to some embodiments of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present application and are not to be construed as limiting the embodiments of the present application.

To facilitate an understanding of the present application, the following description of terms appearing in the present application will be provided:

1. CTC loss function: (ConnectionistTemporalClassification, CTC) is a loss function for sequence learning commonly used in speech recognition and optical character recognition tasks where CTC loss functions can be used to train acoustic models. The CTC loss function-based acoustic model is an end-to-end model and typically contains an output layer for mapping the input signature sequence to the tag sequence. For each time step, the output layer generates a probability distribution representing the label likelihoods for the corresponding time step. These probability distributions will be used to calculate CTC losses to guide the model in learning the correct mapping. In the embodiment of the present application, the CTC loss function is used to output a first probability matrix of a corresponding frame.

2. CE loss function: the cross entropy loss function (CrossEntropyLoss, CE) is a commonly used class loss function, commonly used to measure the difference between model predictions and real labels, i.e. the distance between the probability distribution of the model predictions and the probability distribution of the real labels.

The voice recognition technology of the vehicle means that the automatic recognition and understanding of the state, operation or environment of the vehicle are realized by analyzing the sound signals in or around the vehicle, and can be used in the fields of vehicle safety, driving assistance systems, vehicle-mounted entertainment and the like.

Currently, the voice recognition of a vehicle is generally based on deep learning, so as to realize the recognition and understanding of the voice acquired by the vehicle. Because of limitations such as computational effort and time delay, it is difficult to directly wire a pure neural network solution on a vehicle, and thus, a solution in which acoustic and decoding graphs work cooperatively is generally adopted, wherein an end-to-end acoustic solution based on attention mechanisms (attention) is most widely used.

Vehicles often face a network-free environment when in use. In an end-to-end acoustic scheme based on attention mechanism (attention), when a vehicle recognizes a sound in an offline state, generally, audio is first acquired, an audio signal is output as a pronunciation sequence through an end-to-end acoustic model based on attention, the pronunciation sequence is mapped to a word sequence through a dictionary (l.fst), and finally an optimal word sequence result is output through a voice model, so that the vehicle performs subsequent response work and the like according to the recognized voice result. An Encoder (Encoder) based on an end-to-end acoustic model of the intent, based on a CTC loss function, processes the acquired audio to extract a feature representation, generate a vector sequence, and capture important information in the audio. The Decoder receives corresponding decoding information based on the CE loss function to generate a pronunciation sequence, the attribute mechanism obtains the position of the current Decoder and the output of the Encoder, outputs decoding information required by the Decoder at the current moment to the Decoder, and calculates attention weights in each decoding step to guide the acoustic model to pay attention to different input parts, so that the model can better utilize the context information of the input sequence and improve decoding accuracy.

In the attention of the Encoder, the determination needs to be performed according to the spike path of the CTC loss function, and the spike path usually skips the blank mark in the input, firstly acquires two non-blank vector sequences (spike path) to the right, and then acquires four vector sequences to the left, which can improve the accuracy of recognition but can cause delay in speech recognition.

It can be appreciated that the attention mechanism needs to look at a long range of acoustic features, resulting in huge computation, while when the vehicle is off-line, the acoustic model has no significant computation and memory space available due to chip computation limitations; in addition, under the attention mechanism of the Encoder, the peak paths of the two CTC loss functions need to be acquired rightward first, and even if calculation delay is ignored, delay still exists, so that the interaction speed is slow, and the user experience is affected.

In order to solve the above technical problems, an embodiment of the present application provides an audio recognition method.

An application scenario of the technical solution of the present application is described first, as shown in fig. 1, and the audio recognition method provided in the present application may be applied to the application scenario shown in fig. 1. The audio recognition method is applied to an audio recognition system 1000, which includes a vehicle 100.

The vehicle 100 is any vehicle 100 that can perform audio recognition, such as an automobile, truck, etc.

The vehicle 100 includes a vehicle body 50, a processor 30, and a memory 40, the processor 30 being disposed inside the vehicle body 50.

In one embodiment, the vehicle 100 further includes a memory 40, the memory 40 may be used to store a decoding map of audio, a preset delay error correction matrix, and the like.

In one embodiment, the vehicle 100 further includes a microphone 20, the microphone 20 being disposed inside the vehicle body 50, the microphone 20 being used to collect audio information inside or around the vehicle, a user may issue control instructions to the vehicle via the microphone 20, etc.

In one embodiment, the audio recognition system 1000 of the vehicle further includes a server 200, the server 200 and the vehicle 100 communicating over a network; the delay correction model of the vehicle 100 may be deployed at least one of locally to the vehicle 100 or to the server 200. For example, the delay tolerant model of the vehicle 100 is deployed on the vehicle 100 local and the server 200 and performs delay tolerant processing on the audio, or the delay tolerant model of the vehicle 100 is deployed on the vehicle 100 local and performs delay tolerant processing on the audio in an off-line state.

In one embodiment, the server 200 may be a stand-alone physical server 200, or may be a server 200 cluster or a distributed system formed by a plurality of physical servers 200, or may be a cloud server 200 that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The embodiments of the present application are not limited in this regard.

In one embodiment, the vehicle 100 may include a display screen (not shown) that may present the recognized audio, word sequences generated after processing, for viewing by a user, correcting recognition results, and the like. The display screen can also be an interactive touch display screen, and a user can issue control commands through touch of the display screen, such as modifying the identification result, confirming the identification result and the like.

In one embodiment, the audio recognition system 1000 further includes a terminal 300, the terminal 300 including a display 301. The terminal 300 can communicate with the vehicle 100 to transmit a voice request or the like received through the terminal 300 to the vehicle 100, and the display 301 of the terminal 300 can also display an audio recognition result or the like.

In one embodiment, the terminal 300 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and the like.

In one embodiment, the vehicle 100, the server 200, and the terminal 300 all communicate via a network, for example, any two of the vehicle 100, the server 200, and the terminal 300 may communicate via wireless means (such as wireless local area network (wifi) communication, bluetooth communication, infrared communication, etc.).

It is to be understood that the communication among the vehicle 100, the server 200, and the terminal 300 is not limited to the above-described communication method, and is not limited thereto.

As with wifi communication, the vehicle 100 and the terminal 300 communicate with the server 200, respectively, and then the communication of the vehicle 100 and the server 200 (or the terminal 300) is realized by the server 200; when communication is performed by bluetooth or infrared, the vehicle 100 and the terminal 300 are each provided with a corresponding communication module to directly implement communication therebetween.

In one embodiment, the audio recognition method may be implemented by the vehicle 100 in an off-line state; or by the vehicle 100 and the terminal 300; may also be implemented by the vehicle 100 in an online state; or by the vehicle 100 and the terminal 300; or through the server 200 and the terminal 300; or by the vehicle 100, the server 200 and the terminal 300, etc.

The audio recognition method of the present application will be described in detail below:

referring to fig. 1, 2 and 3, an embodiment of the present application provides an audio recognition method, which includes:

step 011: the audio to be identified is encoded to generate a pronunciation probability matrix.

The audio to be identified may include a request or query sent by a user to the system in a voice manner, and characterizes a function or action that the user wants to make the vehicle execute. In the field of speech recognition and speech control, terms such as "speech instructions", "speech commands", "speech control instructions" and the like are often used to describe a speech phrase used by a user to trigger a particular operation or task. These voice instructions may be individual words, phrases, or complete sentences, which are referred to as "queries". That is, the audio to be identified may be a query used by the user in a voice interaction. For example, the audio to be identified may be "hello", "put a song" and "navigate to supermarket", etc. The pronunciation probability matrix may include pronunciation probabilities of corresponding phonemes at specific time steps, such as calculating pronunciation probabilities of respective pronunciation units at each time frame based on the corresponding pronunciation units, and integrating the pronunciation probabilities of the respective pronunciation units at each time frame to generate the pronunciation probability matrix.

Specifically, after the vehicle acquires the audio to be identified, the audio to be identified can be preprocessed, for example, preprocessing modes such as noise reduction, audio enhancement, audio segmentation and the like can be adopted to improve the audio quality; the preprocessed audio to be identified is subjected to feature extraction through an Encoder to obtain feature vectors, such as Short-time Fourier transform (Short-time Fourier Transform) or Mel frequency cepstrum coefficient (Mel Frequency Cepstral Coefficients, MFCC) and the like, and finally the feature vectors are subjected to coding processing, such as deep neural network (Deep Neural Network, DNN) and the like, so as to generate corresponding pronunciation probability matrixes.

Alternatively, referring to fig. 4 and 5, the audio to be identified may be generated by any of the following steps:

step 015: receiving a voice request sent by a user in a vehicle to generate audio to be identified;

step 016: a voice request received by a terminal associated with a vehicle is received to generate audio to be identified.

Specifically, the user may make a voice request through a microphone on the vehicle. The audio to be identified can be generated according to a voice request sent by a user collected by a microphone; or the vehicle is associated with the terminal (such as through network connection, etc.), the user sends out a voice request through the terminal, the terminal sends the received voice request to the vehicle after receiving the voice request, and the vehicle generates the audio to be identified after receiving the voice request.

Step 012: correcting the pronunciation probability matrix through a preset delay error correction model to obtain a corresponding word result and an output probability matrix;

the preset delay error correction model may be a language model capable of correcting the pronunciation probability matrix.

Specifically, the pronunciation probability matrix can be input into a preset delay error correction model through a processor of the vehicle, and a corresponding word result and an error corrected output probability matrix are output, so that error correction is completed on the pronunciation probability matrix of the audio, and an accurate audio identification result is obtained. In the process of training the preset delay error correction model, one or more similar error pronunciations can be randomly generated according to a certain probability according to a similar pronunciation generator, so that the delay error correction model is trained, and when the input pronunciation is wrong, the mapped label data is correct, that is, the delay error correction model is trained by using the label data and the error pronunciations, so that the delay error correction model can still output a correct result even when the error pronunciations are acquired, and the delay error correction model has error correction capability on the input pronunciation probability matrix.

It will be appreciated that in the delay error correction model, the current attention module will not be required. Because the delay error correction model is a language model, more confidence language model information can be learned in the training and learning process, compared with the existing Decoder, the delay error correction model can better understand the grammar, semantic structure and the like of sentences, so that error correction can be more accurately carried out; in addition, when the delay error correction model is trained, delay exists between the input of the delay error correction model and the label data, so that the model can learn the dependency relationship and the context relationship between pronunciations better, and the accuracy and the training effect of the delay error correction model are further improved. Therefore, the delay error correction model has better error correction capability and language expression capability than the Decoder.

It can be understood that the pronunciation probability matrix is corrected through the preset delay error correction model to obtain the corresponding word result and output probability matrix, so that the accuracy of error correction can be ensured, the current attention module is not needed any more, the calculation force can be effectively saved, the storage space is saved, and the like.

Step 013: and performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix.

Specifically, pronunciation mapping is performed on the word result and the output probability matrix to obtain a mapping probability matrix, for example, a word result and a pronunciation unit corresponding to the word result are found in a pronunciation mapping table through a preset pronunciation mapping table, after mapping is completed on each word result and the output probability matrix according to the pronunciation mapping table, the mapping probability matrix is obtained, and finally, a target probability matrix is generated according to the mapping probability matrix and the pronunciation probability matrix, for example, the mapping probability matrix and the pronunciation probability matrix are accumulated, and then the average value is obtained, so that the target probability matrix is generated.

It will be appreciated that the pronunciation probability matrix is input to the delay error correction model, and the word result and the error corrected output probability matrix are output. Compared with the pronunciation probability matrix, the output probability matrix is processed by more language model information, so that the distribution probability of the sound is possibly different and the distribution probability of the output probability matrix is more accurate. Therefore, in order to further improve the accuracy of recognition, the word result and the output probability matrix with more accurate relative pronunciation probability matrix can be subjected to pronunciation mapping to obtain a mapping probability matrix, and the generated probability distribution is adjusted more finely according to the mapping probability matrix and the pronunciation probability matrix, namely, a target probability matrix with more accurate probability distribution is generated, so that the effect of replacing the attention module is realized.

Step 014: and inputting the target probability matrix into a voice decoding diagram for decoding to obtain a recognition result.

The speech decoding graph may be a decoding operation performed to implement speech recognition, in which a plurality of finite state transducers (Finite State Transducer, FST) are combined to construct a resulting mapping relationship by performing a graph coupling (composition) operation based on weighted finite state transducers (Weighted Finite State Transducer, WFST).

Specifically, the target probability matrix is a pronunciation probability matrix, and the target probability matrix is input into the constructed voice decoding graph for decoding, so that the audio recognition result of the word is obtained.

Therefore, the acquired audio to be identified is encoded to generate the pronunciation probability matrix, the pronunciation probability matrix is corrected through the preset delay error correction model to obtain the corresponding word result and the output probability matrix, the pronunciation probability matrix is corrected through the error correction capability of the delay error correction model, so that the accurate word result is acquired, the accuracy of audio identification is ensured, and the delay error correction model does not need to be based on the current attention module, so that the calculation force can be effectively saved, and the storage space is saved.

And then, performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix, wherein the word result and the output probability matrix are subjected to error correction, so that the accuracy of the recognition result is improved, the accuracy of the audio recognition can be further improved according to the mapping probability matrix and the pronunciation probability matrix, the probability distribution relation of more accurate pronunciation is obtained, and finally, the target probability matrix is input into a voice decoding diagram for decoding, so that the accuracy of the recognition result is improved.

Referring to FIG. 6, in some embodiments, step 011: encoding the audio to be identified to generate a pronunciation probability matrix, comprising:

step 0111: encoding the audio to be identified to output a first probability matrix of a corresponding frame through a first output layer of the acoustic model trained in advance and to output a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the first output layer and the second output layer are mutually independent, the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function;

Step 0112: and splicing the first probability matrix corresponding to the frame number position with the second probability matrix according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate a pronunciation probability matrix.

Specifically, the acoustic model in the embodiment of the application is built through pre-training, the pre-trained acoustic model comprises two mutually independent output layers, the two mutually independent output layers comprise a first output layer and a second output layer, the first output layer and the second output layer are different, the first output layer and the second output layer can respectively output different prediction results, namely a first probability matrix and a second probability matrix, for the same input feature vector, and the first probability matrix is different from the second probability matrix. In addition, the probability matrix outputted corresponds to the modeling unit employed in the acoustic model, depending on the modeling unit. For example, the modeling unit may be one or more of a triphone, syllable, or silence tag.

The loss function of the first output layer is CTC loss function, and the loss function of the second output layer is CE loss function. The different characteristics of the CTC loss function and the CE loss function when processing the input and output of the audio can be known that the audio accuracy after the CTC loss function processing is higher, but the delay of the real pronunciation is relatively high, and the audio accuracy after the CE loss function processing is lower, but the delay of the real pronunciation is relatively not high. For example, referring to fig. 10, taking an example of obtaining a "hello" audio, where the audio duration is 800 milliseconds (ms), the audio of "hello" is divided into 16 frames, and assuming that the user's true pronunciation is sounding "good" in frame 7, the CTC loss function recognizes that the peak of "hao3" occurs only in frame 11, it will be appreciated that the CTC loss function has a delay of (800/16) ×11-7=200 ms, and the CE function does not have such a delay.

It will be appreciated that step 014: the target probability matrix is input into the speech decoding graph to decode the speech decoding graph to obtain a speech decoding graph in the recognition result, and the decoding graph of the end-to-end speech recognition system based on the CTC loss function, for example, HCLG [ HMM (H), context-Dependent Units (CDUs), language Model (L) and Grammar (G) ] decoding graph formed by coupling the decoding graphs of the Language Model, pronunciation dictionary, context relation and markov Model (Hidden Markov Model, HMM) can be used. That is, the speech decoding graph can be changed without considering the input characteristics of the acoustic model based on the CE loss function, that is, without designing two different sets of speech decoding graphs, only one set of decoding graphs of the end-to-end speech recognition system based on the CTC loss function is needed, and the decoding path can be directly input and searched according to the splicing probability matrix, so that the recognition performance of the speech decoding graph is not affected, and the occupation of computing resources is not increased.

Obtaining a corresponding first peak path and frame number positions corresponding to each first peak according to a first probability matrix of the current accumulated frames, namely according to all the audio frames which are processed currently; obtaining a corresponding second peak path and a frame number position corresponding to each second peak according to a second probability matrix of the current accumulated frame; and splicing the first probability matrix and the second probability matrix corresponding to the frame number positions according to the first peak path and the second peak path to obtain a spliced probability matrix.

Referring to FIG. 7, in some embodiments, step 0112: splicing the first probability matrix corresponding to the frame number position with the second probability matrix according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate a pronunciation probability matrix, comprising:

step 01121: aligning the first peak path with the second peak path, and determining peaks with the same label as a splicing starting point;

step 01122: determining a first peak for splicing in the first peak path and a second peak for splicing in the second peak path of a preset frame number according to the splicing starting point;

step 01123: and splicing the first probability matrix of the frame number position corresponding to the first peak with the second probability matrix of the frame number position corresponding to the second peak to obtain a pronunciation probability matrix.

Specifically, in a preset acoustic model, aligning a first peak path and a second peak path, and determining peaks with the same label as a splicing starting point; determining a first peak for splicing in the first peak path and a second peak for splicing in the second peak path of a preset frame number according to the splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak with the second probability matrix of the frame number position corresponding to the second peak to generate a corresponding pronunciation probability matrix.

It can be understood that the loss function of the first output layer is a CTC loss function, the loss function of the second output layer is a CE loss function, and then the accuracy of the first peak path of the first probability matrix of the corresponding frame is higher through the first output layer, but delayed by about two words with respect to the actual pronunciation, and the accuracy of the second peak path of the second probability matrix of the corresponding frame is lower through the second output layer, but the second peak path is not delayed with respect to the actual pronunciation, so that the accuracy of the audio recognition result is not affected by splicing the first probability matrix of the frame number position corresponding to the first peak with the second probability matrix of the frame number position corresponding to the second peak, and by using the delay-free effect of the second peak path and the accuracy of the first peak path, the delay-free pronunciation probability matrix can be generated, and the lower accuracy of the CE loss function can be corrected further due to the error correction capability of the delay error correction module after the pronunciation probability matrix enters the delay error correction module.

For example, referring to fig. 8, when the current time is the 80 th frame, according to the accumulated first probability matrix, a corresponding first peak path sil (40) d_a_3 (56) can be obtained; based on the accumulated second probability matrix, a corresponding second peak path sil (20) d_a_3 (35) k_ai_1 (60) d_i_4 (80), the value in brackets is the corresponding frame number position when the peak outputs, for example, "k_ai_1 (60)" indicates that the peak of the frame is determined to be k_ai_1 at the 60 th frame, and the second probability matrix corresponding to the 60 th frame is the probability matrix of the peak. It is understood that each frame spike in the first spike path and the second spike path has a probability matrix for the corresponding frame.

The first spike path includes "sil d_a_3", the second spike path includes "sil d_a_3k_ai_1d_i_4", the first spike path and the second spike path are aligned from back to front (i.e., from left to right) in time sequence, the same label that is the foremost (i.e., the foremost right) is found, and it is seen that both have the same two-frame spikes, "sil" and "d_a_3".

That is, the "sil d_a_3" in the first spike path is taken as a splice start point, and the remaining spikes of the preset number of frames excluding the same spike in the second spike path are spliced after the splice start point. For example, the preset number of frames may be selected from 1 to 3 frames, for example, 2 frames. That is, two frames of peaks in the second peak path, which are repeated with the splicing start point, are deleted, and then the rest 2 frames of peaks are spliced, so that two editing actions are adopted in total. That is, 2 frames "k_ai_1d_i_4" in the second peak path are spliced after 2 frames "sil d_a_3" of the first peak path, and the obtained spliced peak path is "sil d_a_3k_ai_1d_i_4", so that two frames of first probability matrixes corresponding to "sil d_a_3" and two frames of second probability matrixes corresponding to "k_ai_1d_i_4" are spliced, and a splicing probability matrix at the current moment is obtained.

It should be noted that the delay error correction model is independent of the preset acoustic model and is obtained by training according to preset unsupervised ZhuYin text data.

Specifically, the preset acoustic model and the delay error correction model are independent of each other, the delay error correction model is trained independently of the preset acoustic model, and the delay error correction model is obtained through training according to a large amount of preset unsupervised ZhuYin text data instead of audio data. The unsupervised phonetic text data is relatively easy to obtain, for example, a large amount of text data can be collected through the Internet, books, articles and the like, so that the training process is simplified, and the training efficiency is improved; moreover, the delay error correction model does not depend on audio data, and the influence of the distribution deviation of the audio training set on the performance of the model can be avoided by learning the mapping relation between the text and the phonetic notation.

Referring to fig. 9, in certain embodiments, step 013: performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix, including:

step 0131: adding probabilities of words corresponding to the same tone to obtain a mapping probability matrix;

Step 0132: and accumulating the probability matrixes corresponding to the same sound in the mapping probability matrix and the pronunciation probability matrix, and taking an average value to generate a target probability matrix.

Specifically, probabilities of words corresponding to the same tone are added as probabilities of the corresponding tone to obtain a mapping probability matrix, and the probabilities of the same tone in the mapping probability matrix and the pronunciation probability matrix are accumulated and averaged to generate a target probability matrix.

For example, referring to fig. 10, taking an audio of "hello" to generate a pronunciation probability matrix of 2000-dimensional sound and an output probability matrix of a word corresponding to generate a 5000-dimensional sound as an example, in a preset mapping table, a plurality of words are mapped with the same sound, probabilities of a plurality of words of the same sound are added to be used as probabilities of the corresponding sound, so that the output probability matrix of the word of 5000-dimensional is reversely mapped to be a mapping probability matrix of the sound of 2000-dimensional, and then the probability matrices corresponding to the same sound in the mapping probability matrix and the pronunciation probability matrix are accumulated, and an average value is obtained to generate a target probability matrix.

Referring to fig. 11, in order to better implement the audio recognition method according to the embodiment of the present application, the embodiment of the present application further provides an audio recognition apparatus 10. The audio identifying apparatus 10 may include an encoding module 11, an error correction module 12, a mapping module 13, and a decoding module 14, where the encoding module 11 is configured to encode audio to be identified to generate a pronunciation probability matrix; the error correction module 12 is configured to correct the pronunciation probability matrix through a preset delay error correction model, so as to obtain a corresponding word result and an output probability matrix; the mapping module 13 is used for performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix; the decoding module 14 is configured to input the target probability matrix to the speech decoding graph for decoding, so as to obtain a recognition result.

In one embodiment, the encoding module 11 is specifically further configured to encode the audio to be identified, so as to output a first probability matrix of the corresponding frame through a first output layer of the pre-trained acoustic model, and output a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the first output layer and the second output layer are mutually independent, the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function; and splicing the first probability matrix corresponding to the frame number position with the second probability matrix according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate a pronunciation probability matrix.

In one embodiment, the encoding module 11 is specifically further configured to align the first spike path with the second spike path, and determine spikes with the same label as the splice starting point; determining a first peak for splicing in the first peak path and a second peak for splicing in the second peak path of a preset frame number according to the splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak with the second probability matrix of the frame number position corresponding to the second peak to obtain a pronunciation probability matrix.

In one embodiment, the mapping module 13 is specifically further configured to add probabilities of words corresponding to the same tone as probabilities of corresponding tones to obtain a mapping probability matrix; and accumulating the probability matrixes corresponding to the same sound in the mapping probability matrix and the pronunciation probability matrix, and taking an average value to generate a target probability matrix.

In one embodiment, the audio recognition device 10 further includes a receiving module 15, where the receiving module 15 is configured to receive a voice request sent by a user in the vehicle, so as to generate audio to be recognized; or a voice request received by a terminal associated with the vehicle to generate audio to be identified.

The audio recognition apparatus 10 is described above in connection with the accompanying drawings from the perspective of functional modules, which may be implemented in hardware, instructions in software, or a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware encoding processor or implemented by a combination of hardware and software modules in the encoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Referring again to fig. 1, the vehicle of the present embodiment includes a processor 30, a memory 40, and a computer program, wherein the computer program is stored in the memory 40 and executed by the processor 30, and the computer program includes instructions for executing the audio recognition method of any of the above embodiments.

Referring to fig. 12, a computer device of an embodiment of the present application includes a processor 402, a memory 403, and a computer program, where the computer program is stored in the memory 403 and executed by the processor 402, and the computer program includes instructions for executing the audio recognition method of any of the above embodiments.

In one embodiment, the computer device may be a terminal 400 or a vehicle 100. The internal structure thereof can be shown in fig. 12. The computer device comprises a processor 402, a memory 404, a network interface 404, a display 401 and an input means 405, which are connected by a system bus.

Wherein the processor 402 of the computer device is used to provide computing and control capabilities. The memory 404 of the computer device includes non-volatile storage media, internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface 404 of the computer device is used to communicate with external devices via a network connection. The computer program, when executed by a processor, implements the audio recognition method and presentation method of any of the above embodiments. The display 401 of the computer device may be a liquid crystal display or an electronic ink display, and the input device 405 of the computer device may be a touch layer covered on the display 401, or may be a key, a track ball or a touch pad arranged on a casing of the computer device, or may be an external keyboard, a touch pad or a mouse.

It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Referring to fig. 13, the embodiment of the present application further provides a computer readable storage medium 600, on which a computer program 610 is stored, where the computer program 610, when executed by the processor 620, implements the steps of the audio recognition method of any of the above embodiments, which is not described herein for brevity.

In the description of the present specification, reference to the terms "certain embodiments," "in one example," "illustratively," and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiments or examples is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the present application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the present application.

Claims

1. An audio recognition method, comprising:

encoding the audio to be identified to generate a pronunciation probability matrix;

correcting the pronunciation probability matrix through a preset delay error correction model to obtain a corresponding word result and an output probability matrix;

Performing pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix;

and inputting the target probability matrix into a voice decoding diagram for decoding so as to obtain a recognition result.

2. The audio recognition method of claim 1, wherein the encoding the audio to be recognized to generate the pronunciation probability matrix comprises:

encoding the audio to be identified to output a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model and to output a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the first output layer and the second output layer are mutually independent, the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function;

and splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate the pronunciation probability matrix.

3. The audio recognition method according to claim 2, wherein the concatenating the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path of the first probability matrix and the second peak path of the second probability matrix to generate the pronunciation probability matrix includes:

aligning the first peak path with the second peak path, and determining peaks with the same label as a splicing starting point;

determining a first peak for splicing in a first peak path and a second peak for splicing in a preset frame number in a second peak path according to the splicing starting point;

and splicing the first probability matrix of the frame number position corresponding to the first peak with the second probability matrix of the frame number position corresponding to the second peak to obtain the pronunciation probability matrix.

4. The audio recognition method according to claim 1, wherein the delay correction model is independent of a preset acoustic model and is obtained by training according to preset unsupervised ZhuYin text data.

5. The method of claim 1, wherein said pronunciation mapping the word result and the output probability matrix to obtain a mapped probability matrix, and generating a target probability matrix based on the mapped probability matrix and the pronunciation probability matrix, comprises:

Adding probabilities of words corresponding to the same tone as probabilities of the corresponding tone to obtain the mapping probability matrix;

and accumulating the probability matrixes corresponding to the same sound in the mapping probability matrix and the pronunciation probability matrix, and then taking an average value to generate the target probability matrix.

6. The audio recognition method of claim 1, further comprising:

receiving a voice request sent by a user in a vehicle to generate the audio to be identified; or alternatively

And receiving a voice request received by a terminal associated with the vehicle to generate the audio to be identified.

7. An audio recognition apparatus, comprising:

the coding module is used for coding the audio to be identified to generate a pronunciation probability matrix;

the error correction module is used for correcting the pronunciation probability matrix through a preset delay error correction model so as to obtain a corresponding word result and an output probability matrix;

the mapping module is used for carrying out pronunciation mapping on the word result and the output probability matrix to obtain a mapping probability matrix, and generating a target probability matrix according to the mapping probability matrix and the pronunciation probability matrix;

And the decoding module is used for inputting the target probability matrix into the voice decoding graph to decode so as to obtain a recognition result.

8. A vehicle, characterized by comprising:

a processor, a memory; and

A computer program, wherein the computer program is stored in the memory and executed by the processor, the computer program comprising instructions for performing the audio recognition method of any one of claims 1 to 6.

9. A computer device, comprising:

a processor, a memory; and

10. A non-transitory computer readable storage medium containing a computer program, characterized in that the computer program, when executed by a processor, causes the processor to perform the audio recognition method of any one of claims 1 to 6.