CN115910044A

CN115910044A - Voice recognition method and device and vehicle

Info

Publication number: CN115910044A
Application number: CN202310032078.0A
Authority: CN
Inventors: 张辽
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-04-04
Anticipated expiration: 2043-01-10
Also published as: CN115910044B

Abstract

The application relates to a voice recognition method, a voice recognition device and a vehicle. The method comprises the following steps: receiving a voice request sent by a user in a vehicle cabin; performing feature extraction on the voice request to be recognized to generate a feature vector corresponding to each frame of audio; according to the input feature vector, outputting a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model, and outputting a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function; and acquiring a splicing probability matrix of the first probability matrix and the second probability matrix, inputting the splicing probability matrix into a voice decoding image of the voice recognition system, searching to obtain a decoding path, outputting a voice recognition text and displaying the voice recognition text on a graphical user interface of the vehicle-mounted system. According to the scheme, the time delay of the voice recognition text can be eliminated, the recognition efficiency is improved, the on-screen display is accelerated, and the user experience is improved.

Description

Voice recognition method and device and vehicle

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a speech recognition device, and a vehicle.

Background

In the construction of the acoustic model, there is an acoustic model trained based on a cross-entropy (CE) loss function, and each frame of audio and phoneme labels of a training sample needs to be aligned forcibly before training, so as to map an input frame of a speech feature with an output phoneme label sequence frame by frame. The training mode is such that the acoustic model obtained by training based on the cross entropy loss function can recognize the voice without delay. In the end-to-end speech recognition system, there is an acoustic model trained based on a loss function of CTC (connection temporal classification), and only one input feature vector and one output phoneme label sequence are needed to train without performing alignment labeling of each frame, and how long each phoneme lasts is not concerned.

However, based on the processing characteristics of the acoustic model of CTC, the predicted phoneme peak in the sequence has a delay of about 2 to 3 phoneme peaks compared with the actual pronunciation, and the delay is about 200ms to 300ms due to error accumulation, so that the speech recognition text is delayed to output, and the recognition efficiency is affected.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the application provides a voice recognition method, a voice recognition device and a vehicle, which can eliminate the delay time of voice recognition texts, improve the recognition efficiency, accelerate the display on a screen and improve the user experience.

A first aspect of the present application provides a speech recognition method, including:

receiving a voice request sent by a user in a vehicle cabin;

performing feature extraction on the voice request to be recognized to generate a feature vector corresponding to each frame of audio;

according to the input feature vector, outputting a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model, and outputting a second probability matrix of the corresponding frame through a second output layer of the acoustic model; wherein the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function;

and acquiring a splicing probability matrix of the first probability matrix and the second probability matrix, inputting the splicing probability matrix into a voice decoding graph of the voice recognition system, searching and acquiring a decoding path, and outputting a voice recognition text and displaying the voice recognition text on a graphical user interface of the vehicle-mounted system.

In some embodiments, obtaining a joint probability matrix of the first probability matrix and the second probability matrix comprises:

obtaining corresponding first peak paths and frame number positions corresponding to the first peaks according to the first probability matrix of the current accumulated frame; according to the second probability matrix of the current accumulated frame, obtaining corresponding second peak paths and frame number positions corresponding to each second peak; and splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path and the second peak path and a preset rule to obtain the splicing probability matrix.

In some embodiments, the splicing the first probability matrix corresponding to the frame number position and the second probability matrix according to the first peak path and the second peak path and according to a preset rule to obtain the spliced probability matrix includes:

aligning the first peak path and the second peak path, and determining a peak with the same label as a splicing starting point; determining a first peak used for splicing in the first peak path and a second peak used for splicing in the second peak path with a preset frame number according to the splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak and the second probability matrix of the frame number position corresponding to the second peak to obtain the spliced probability matrix.

In some embodiments, after the obtaining the splicing probability matrix, the method further includes:

and rolling back the decoding path of the voice decoding graph to a corresponding decoding state according to the splicing starting point.

In some embodiments, the acoustic model is pre-trained and constructed in the following manner;

acquiring a first phoneme label sequence of training audio based on CTC, and acquiring a second phoneme label sequence of the training audio based on CE and peaking;

inputting the feature sequence of the same training audio and the corresponding first phoneme label sequence and second phoneme label sequence into an acoustic model to be trained for training to obtain a trained acoustic model; the acoustic model is provided with a shared hidden layer and independent first and second output layers, and is subjected to back propagation according to the loss values of a CTC loss function of the first output layer and a CE loss function of the second output layer respectively so as to iteratively adjust the weight parameters of the hidden layer until the loss values of the CTC loss function and the CE loss function reach convergence.

In some embodiments, said obtaining a second sequence of CE-based and peaked phoneme labels for the training audio comprises:

carrying out forced frame alignment on the training audio and the original phoneme label sequence based on the CE to obtain an alignment result; and performing spiking processing on the labels with the same phoneme in the original phoneme label sequence to replace the repeated labels with blank labels and reserve single phoneme labels, so as to obtain a first phoneme label sequence.

In some embodiments, after outputting the speech recognition text, the method further comprises:

when the voice request is not finished, judging the semantic integrity and the field attribute of the current voice recognition text to obtain a judgment result; when the judgment result is not in accordance with the judgment result, rolling back the splicing probability matrix to the splicing starting point, and rolling back the decoding path of the voice decoding graph to a corresponding decoding state; and inputting the newly received first probability matrix into the speech decoding graph according to the splicing starting point, searching for a decoding path, and outputting updated speech recognition text for the NLU model to respond.

In some embodiments, the method further comprises:

and when the voice request is finished or the judgment result is in accordance, the NLU model responds according to the current voice recognition text.

A second aspect of the present application provides a speech recognition apparatus, comprising:

the voice receiving module is used for receiving a voice request sent by a user in a vehicle cabin;

the feature processing module is used for extracting features of the voice request to be recognized and generating feature vectors corresponding to each frame of audio;

the acoustic processing module is used for outputting a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model according to an input feature vector and outputting a second probability matrix of the corresponding frame through a second output layer of the acoustic model; wherein the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function;

and the decoding and identifying module is used for acquiring a splicing probability matrix of the first probability matrix and the second probability matrix, inputting the splicing probability matrix into a voice decoding diagram of the voice recognition system, searching and acquiring a decoding path, and outputting a voice recognition text and displaying the voice recognition text on a graphical user interface of the vehicle-mounted system.

A third aspect of the present application provides a vehicle comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon executable code which, when executed by a processor of a vehicle, causes the processor to perform a method as described above.

The technical scheme provided by the application can comprise the following beneficial effects:

according to the voice recognition method, two different output layers are designed in the acoustic model, and on the premise that the memory load and the CPU load are not increased, the real-time performance of the acoustic model prediction sequence based on the CE and the convenience of the acoustic model based on the CTC are integrated, so that the splicing probability matrix of each frame of audio in the voice request can be obtained without delay and used as the input of the voice decoding graph, the voice recognition text can be rapidly obtained and displayed on the graphical user interface of the vehicle-mounted system in time, the recognition efficiency is improved, and the user experience is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 is a flow diagram of a speech recognition method shown in the present application;

FIG. 2 is a flow diagram of a method for training an acoustic model of a speech recognition method shown in the present application;

FIG. 3 is a schematic diagram comparing different training labels;

FIG. 4 is a schematic diagram of a comparison of decoding logic of a conventional acoustic model based speech recognition system and the speech recognition system of the present application;

FIG. 5 is a flow chart of a method for stitching a probability matrix of a speech recognition method shown in the present application;

FIG. 6 is another flow chart diagram of a speech recognition method shown in the present application;

fig. 7 is a schematic structural diagram of a speech recognition apparatus shown in the present application;

fig. 8 is another structural diagram of the speech recognition apparatus shown in the present application;

fig. 9 is a schematic structural diagram of a vehicle shown in the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

In the related art, the acoustic model of the CTC loss function is used, and the phoneme peaks in the predicted sequence have a delay of about 2 to 3 phoneme peaks compared with the actual pronunciation, so that the speech recognition result decoded and output by the speech recognition system is delayed.

In view of the above problems, the present application provides a speech recognition method, which can eliminate the delay time of speech recognition text, improve recognition efficiency, accelerate on-screen display, and improve user experience.

The technical scheme of the application is described in detail in the following with the accompanying drawings.

Fig. 1 is a flow chart of a speech recognition method shown in the present application.

Referring to fig. 1, the present application illustrates a speech recognition method comprising:

and S110, receiving a voice request sent by a user in the vehicle cabin.

The application can be applied to voice recognition of a user voice request in a vehicle. In this step, the voice request sent by the user can be collected through a microphone installed in the vehicle, so that the voice request can be subjected to voice recognition through a voice recognition system in the vehicle-mounted intelligent terminal.

And S120, performing feature extraction on the voice request to be recognized, and generating a feature vector corresponding to each frame of audio.

In this step, the speech signal of the speech request to be recognized may be framed according to the correlation technique, and feature extraction may be performed on each frame of speech signal, respectively, to obtain a feature vector corresponding to each frame of speech signal.

S130, outputting a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model according to the input feature vector, and outputting a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function.

Different from the traditional acoustic model based on the CTC loss function, which only has one output layer and is used for a prediction result corresponding to one input feature vector, the acoustic model is constructed through pre-training and comprises the output layer, a plurality of layers of hidden layers and two layers of mutually independent output layers, wherein the two different output layers are the first output layer and the second output layer. The two output layers can respectively output different prediction results, namely a first probability matrix and a second probability matrix, to the feature vector of the same input. The probability matrix is a probability distribution matrix of each modeling unit corresponding to the feature vector of the current frame, and the sum of all probabilities is 1. It should be noted that, depending on the modeling unit used in the acoustic model, the output probability matrix corresponds to the modeling unit. For example, the modeling units may be one or more of triphones, syllables, or silence tags.

It will be appreciated that the first probability matrix and the second probability matrix of the output are different for the feature vector of the same input based on the difference between the two output layers.

And S140, acquiring a splicing probability matrix of the first probability matrix and the second probability matrix, inputting the splicing probability matrix into a voice decoding image of the voice recognition system, searching to obtain a decoding path, and outputting a voice recognition text to be displayed on a graphical user interface of the vehicle-mounted system.

In order to obtain the determined output result, before the path search is performed on the speech decoding graph input to the decoder, in this step, the first probability matrix and the second probability matrix may be spliced according to a preset rule to obtain a splicing probability matrix. In some embodiments, according to the first probability matrix of the current accumulated frame, obtaining the corresponding first peak path and the frame number position corresponding to each first peak; obtaining corresponding second peak paths and frame number positions corresponding to the second peaks according to the second probability matrix of the current accumulated frame; and splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path and the second peak path and a preset rule to obtain a spliced probability matrix. In a specific embodiment, aligning the first peak path with the second peak path, and determining a peak with the same label as a splicing starting point; determining a first peak used for splicing in a first peak path and a second peak used for splicing in a preset frame number in a second peak path according to a splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak and the second probability matrix of the frame number position corresponding to the second peak to obtain a splicing probability matrix.

In the present application, the speech decoding graph may be a decoding graph of an end-to-end speech recognition system based on a CTC loss function, such as an HCLG decoding graph formed by coupling decoding graphs of a language model, a pronunciation dictionary, a context and an HMM. That is to say, the speech decoding graph can be modified without considering the input characteristics of the CE-based acoustic model, that is, two different sets of speech decoding graphs do not need to be designed, and only one set of decoding graph of the end-to-end speech recognition system based on the CTC loss function is needed, so that the decoding path can be directly input and searched according to the concatenation probability matrix without affecting the recognition performance of the speech decoding graph and without increasing the occupation of computing resources.

As can be seen from this example, the speech recognition method of the present application integrates the real-time property of the CE-based acoustic model prediction sequence and the convenience of the CTC-based acoustic model by designing two different output layers in the acoustic model without increasing the memory load and the CPU load, so that the splice probability matrix of each frame of audio in the speech request can be obtained without delay as the input of the speech decoding graph, and thus the speech recognition text can be quickly obtained and displayed on the graphical user interface of the vehicle-mounted system in time, thereby improving the recognition efficiency and improving the user experience.

Fig. 2 is a flowchart illustrating a method for training an acoustic model of a speech recognition method according to the present application. The training process of the acoustic model of the present application will be described in detail below.

Referring to fig. 2, the present application illustrates a training method of an acoustic model, which includes:

s210, acquiring a first phoneme label sequence of the training audio based on CTC, and acquiring a second phoneme label sequence of the training audio based on CE and peaking.

In this step, in order to enable the acoustic model to fuse the characteristics of different loss functions, two kinds of labels are respectively provided for the training audios used for training the model. The first type of tag, i.e., the first phoneme tag sequence, may be, for example, a phoneme tag sequence used in a conventional CTC-based acoustic model. The second phoneme label sequence sets the labels and peaked label characteristics of the CE based acoustic model. In some embodiments, the training audio and the CE-based original phoneme label sequence are subjected to forced frame alignment to obtain an alignment result; and performing spiking processing on the labels with the same phoneme in the original phoneme label sequence, so that the repeated labels are replaced by blank labels and the remained single phoneme labels are reserved, thereby obtaining a first phoneme label sequence.

That is, the CE-based original phoneme label sequence and the training audio may be subjected to forced frame alignment, and phoneme labels corresponding to each frame are determined, for example, 90 frames of audio have 90 phoneme labels mapped one by one. It will be appreciated that portions of the audio frame have the same phoneme tags, exemplified by the original phoneme tag sequence "d _ a _3 _ 15 k _ai _1 _25d _i _4 _20t _ _2 _30" of the "open map", the phoneme labels of the first 15 frames of audio in the sequence are all'd _ a _ 3', the 16 th to 40 th frames are all 'k _ ai _ 1', the 41 th to 60 th frames are all "d _ i _4", the 61 th to 90 th frames are all "t _ u _2", and the like, i.e., the label "d _ a _3" is repeated 15 times, the label "k _ ai _1" is repeated 25 times, the label "d _ i _4" is repeated 20 times, and the label "t _ u _2" is repeated 30 times. For these repeated labels, only the phoneme label of one frame can be retained, and the other repeated labels are replaced by blank labels with placeholders, such as "blank", so as to form a peaking effect. For example, "d _ a _3 _ 15" may be replaced with "blank" 14 d _a _3 "1, that is, the phoneme label of the last frame is retained, and the blank labels are replaced for the first 14 frames. By the way of analogy, the method can be used, a second phoneme label sequence "blank 14 d _a _31 blank 24 k _ai _1blank 1 d 19 d _i _41 blank 29 t _ _21" may be obtained.

In other embodiments, to improve the prediction efficiency of the first output layer and reduce the delay, in some embodiments, a mute label corresponding to the mute audio is added to the first sequence of phoneme labels. It can be understood that through forced alignment and endpoint detection, the mute duration between every two adjacent words can be determined, and the corresponding mute label "sil" can be labeled according to different mute durations. For example, each section of mute audio with different lengths is labeled with a single mute label; or dividing each section of mute audio into multi-section sub-mute audio according to a fixed time length and a rounding-down rule, and respectively labeling a single mute label for each section of sub-mute audio. <xnotran> , 3 , A , CTC "d _ a _3 k_ai_1 d_i_4 t_u_2", , B "sil d _ a _3 k_ai_1 d_i_4 sil t_u_2 sil". </xnotran> If the mute label is labeled according to the second labeling method with a fixed duration of 60ms, the new first phoneme label sequence is "sil sil sil sil d _ a _3 k _ai _ _1d _, 4 sil sil sil sil sil sil t _, u _, 2 sil sil sil sil sil sil sil sil sil". By adding the mute label to the first phoneme label sequence, the naturally occurring mute audio can form a mute peak to replace the phoneme peak of the effective audio, so that the effective audio before the mute audio is output at an accelerated speed, the time is prolonged, and the output efficiency of the first output layer of the acoustic model is improved. It can be understood that, according to the newly added mute label, each state node of the CTC-based speech decoding graph is also added with a corresponding self-skipping arc-shifting, the input label of the self-skipping arc-shifting is the mute label, and the output label is a meaningless placeholder, so that the mute label can be filtered out from the decoding result.

Accordingly, with reference to the way in which the mute audio is labeled in the first phoneme label sequence, the mute label may be added to the second phoneme label sequence simultaneously. That is, when label mapping is performed on each frame, the silence labels, instead of the phoneme labels, can be respectively labeled on the naturally and actually existing silence audio. For example, as shown in a of fig. 3 and C of fig. 3, in the CE-based original phoneme tag sequence "D _ a _3 _ a _ 15 _ u ai 1 _ 25D_4 _ u_2 \" sil 20 a 3 a 15 k ai u 25D_i 4 u 20 sil 25 t _25u 2 si l 30 si 50 "after adding the mute tag, the spiked second phoneme tag sequence in D of fig. 3 is" blank: "19 sil 1 a 3 u 24 k \" 24 si l and i 14D u a 1 u 3 a 1 blank 24 k \ "and f 3 a.

S220, inputting the feature sequence of the same training audio and the corresponding first phoneme label sequence and second phoneme label sequence into an acoustic model to be trained for training to obtain a trained acoustic model; the acoustic model is provided with a shared hidden layer and a first output layer and a second output layer which are independent, and the acoustic model carries out back propagation according to the loss values of the CTC loss function of the first output layer and the CE loss function of the second output layer respectively so as to iteratively adjust the weight parameters of the hidden layer until the loss values of the CTC loss function and the CE loss function are converged.

In the step, according to a training principle of a neural network model, a feature sequence of a training audio and a corresponding first phoneme label sequence and a corresponding second phoneme label sequence are input into an acoustic model through an input layer, are subjected to forward propagation through a common hidden layer, and then are output to prediction results from different output layers. The prediction result and the true value, namely the first loss value of the first phoneme label sequence, can be calculated through the CTC loss function in the first output layer, and the prediction result and the true value, namely the second loss value of the second phoneme label sequence, can be calculated through the CE loss function in the second output layer.

In order to fuse the first loss value and the second loss value, in some embodiments, the first loss value and the second loss value may be fused according to different weight ratios to form a unique loss value. Preferably, the weight proportion of the first loss value is greater than the weight proportion of the second loss value. Then, the fused loss value is propagated backward to the hidden layers along the gradient minimum direction to correct each weight parameter in each hidden layer. And iterating the weight parameters of the hidden layer according to the program until the first loss value and the second loss value respectively reach convergence, thus obtaining the trained acoustic model.

As can be seen from this example, by performing a spiking process on the phoneme label sequence of the CE characteristics by using the blank label, the acoustic model can obtain the second probability matrix of the recognition result 0 delay in the actual prediction, and a speech decoding graph designed based on CTC can be used; compared with the original CTC-based acoustic model architecture, the trained acoustic model only increases one output layer, increases the memory load 0 and almost increases the CPU load to zero.

Based on the trained acoustic models, a method for concatenating the first probability matrix and the second probability matrix in the speech recognition method of the present application will be further described below.

Referring to fig. 4, the present application illustrates a method for splicing probability matrices, which includes:

s310, according to the first probability matrix of the current accumulated frame, the corresponding first peak path and the frame number position corresponding to each first peak are obtained.

In this step, with the continuous reception of the voice request, the first probability matrix corresponding to each frame of audio may be continuously output sequentially through the first output layer of the acoustic model. And the modeling unit corresponding to the maximum probability value in the first probability matrix of each frame is the peak of the current frame. It will be appreciated that the peaks of the first probability matrix per frame may be a blank label, a mute label or some phoneme label.

In this step, the peak of the blank label is deleted, and a first peak path composed of the mute label and the phoneme label is reserved. The frame number position corresponding to each mute label and each phoneme label may be recorded, that is, the corresponding label is parsed in the second frame, so as to determine the corresponding first probability matrix according to the frame number position, so as to form the concatenation probability matrix in the following step S330.

And S320, acquiring corresponding second peak paths and frame number positions corresponding to the second peaks according to the second probability matrix of the current accumulated frame.

Similarly, referring to S310, a second peak path formed by the silence label and the peak corresponding to the phoneme label in the second probability matrix of each frame may be obtained, which is not described herein.

The steps S310 and S320 are executed simultaneously to obtain the second peak path and the second peak path at the current time, and the frame number position corresponding to the second probability matrix corresponding to each mute label and each phoneme label may be recorded.

And S330, splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path and the second peak path and a preset rule to obtain a splicing probability matrix.

In this step, in order to perform splicing quickly, in a specific embodiment, the first peak path and the second peak path are aligned, and a peak having the same label is determined as a splicing start point; determining a first peak used for splicing in the first peak path and a second peak used for splicing in the second peak path by a preset frame number according to the splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak and the second probability matrix of the frame number position corresponding to the second peak to obtain a splicing probability matrix.

For easy understanding, as shown in fig. 5, for example, when the current time is the 80 th frame, according to the accumulated first probability matrix, the corresponding first spike path may be obtained as sil (40) d _ a _3 (56); according to the accumulated second probability matrix, the corresponding second peak path is sil (20) d _ a _3 (35) k _ ai _1 (60) d _ i _4 (80), the value in the parentheses is the frame number position corresponding to the peak output, for example, "k _ ai _1 (60)" indicates that the peak of the frame is determined to be k _ ai _1 at the time of the 60 th frame, and the second probability matrix corresponding to the 60 th frame is the probability matrix of the peak. It is to be understood that each frame spike in the first spike path and the second spike path has a probability matrix for the corresponding frame.

The first spike path and the second spike path are aligned according to the following table from back to front (i.e. from left to right in the table) in time sequence, and the same label at the forefront (i.e. the rightmost) is found, and both of them have the same two-frame spike, i.e. "sil" and "d _ a _3".

First peak path	sil d_a_3
		Second peak path	sil d_a_3 k_ai_1 d_i_4

That is, the "sil d _ a _3" in the first peak path is used as a splicing start point, and the peaks of the remaining preset number of frames excluding the same peak in the second peak path are spliced after the splicing start point. For example, the preset number of frames may be selected from 1 to 3 frames, for example, 2 frames. That is, two repeated peaks of the second peak path from the start point of the splicing are deleted, and then the remaining 2 peaks are spliced, and the total editing operation is performed by two steps. <xnotran> 2 "k _ ai _1 d_i_4" 2 "sil d _ a _3" , "sil d _ a _3 k_ai_1 d_i_4", "sil d _ a _3" "k _ ai _1 d_i_4" , . </xnotran>

It should be noted that, by limiting the preset frame number, the problem of inaccurate prediction result caused by too many second probability matrixes can be avoided. For example, only 2 frames of second probability matrixes with closer time sequence are spliced, and other second probability matrixes except 2 frames are removed, so that the obtained spliced probability matrix is composed of all first probability matrixes and second probability matrixes with preset frame number, the problem that the output of the traditional CTC acoustic model is delayed by a plurality of phoneme spikes is solved through the spliced second probability matrixes, the output result 0 of the acoustic model is delayed, and then the voice recognition text can be obtained through a voice decoding graph more quickly, and the user experience is improved. It is understood that when the first peak path and the second peak path are aligned and the same label does not exist, the first probability matrix is used as the input data of the speech decoding graph.

Referring to fig. 6, a speech recognition method shown in the present application will be further described in conjunction with the above embodiments, and includes:

and S410, receiving a voice request sent by a user in the vehicle cabin.

And S420, performing feature extraction on the voice request to be recognized, and generating a feature vector corresponding to each frame of audio.

S430, outputting a first probability matrix of the corresponding frame through a first output layer of the acoustic model and outputting a second probability matrix of the corresponding frame through a second output layer of the acoustic model according to the input feature vector.

The above steps S410 to S430 refer to the corresponding descriptions of S110 to S130, which are not described herein again.

S440, acquiring a splicing starting point of the first probability matrix and the second probability matrix.

In this step, the splicing start point may be obtained according to the related descriptions in S310 to S330, which is not described herein.

And S450, splicing the first probability matrix and the second probability matrix according to a splicing starting point and a preset rule to obtain a splicing probability matrix.

The preset rule in this step is the same as the preset rule, and is not described herein again.

S460, inputting the splicing probability matrix into a voice decoding image of the voice recognition system to search and obtain a decoding path, and obtaining a voice recognition text and displaying the voice recognition text on a graphical user interface of the vehicle-mounted system; and rolling back the decoding path of the voice decoding graph to a corresponding decoding state according to the splicing starting point.

In this step, after the current speech recognition text is obtained, the decoding path can be rolled back to the decoding state corresponding to the splicing starting point, and the first probability matrix originally delayed later than the real time point is output frame by the first output layer, so that the steps S440 to S460 are repeated according to the first probability matrix and the second probability matrix which are output latest. The method comprises the steps of adding a first probability matrix corresponding to a previous splicing starting point to a first probability matrix output latest at the current moment, determining a new splicing starting point with a second probability matrix at the current moment, splicing to obtain a latest splicing probability matrix as input data of a voice decoding graph, outputting a latest voice recognition result, and then circularly rolling back the decoding state of a decoding path until a voice request is finished, wherein the decoding state of the voice decoding graph does not need to be rolled back again.

In this step, after the voice recognition text is obtained, whether the voice request is finished or not is directly displayed on a screen on a graphical user interface of the vehicle-mounted system.

And S470, when the voice request is not finished, judging the semantic integrity and the field attribute of the current voice recognition text to obtain a judgment result.

The purpose of this step to obtain the decision result is to determine whether the current speech recognition text is reliable enough for the NLU model to respond. The NLU model can be preset with a discriminator, so that whether the semantics of the current speech recognition text is complete or not and whether the domain (domain) conforms to the preset domain or not can be judged. It is understood that a domain refers to the same type of data or resources, and the services provided around those data or resources. In the vehicle-mounted system, a predetermined field with clear intention such as "car machine control/stock/weather/music/broadcast/" is provided. It can be understood that a voice request conforming to a preset domain has a short and definite voice semantic. Therefore, when the acoustic model is trained, a large number of samples which accord with the preset field can be added into the training audio to carry out model training, so that the voice request which is sent by a user and accords with the field can be quickly met.

In the step, a voice request which is generated when the user chats idle and does not accord with the preset field is distinguished, the first probability matrix is supplemented based on the output speed delayed by the second probability matrix 0, so that the voice request which accords with the preset field can be quickly recognized without delay and displayed on a screen, and meanwhile, the NLU model can be conveniently and quickly responded.

If the semantics of the current speech recognition text are complete and the domain is in line, executing step S480; step S470 is performed if the semantics of the current speech recognition text are incomplete or the domain does not conform.

S480, when the judgment result is that the judgment result is not accordant, rolling back the splicing probability matrix to the splicing starting point, and rolling back the decoding path of the voice decoding graph to a corresponding decoding state; and inputting the newly received first probability matrix into the speech decoding graph according to the splicing starting point, searching to obtain a decoding path, and outputting an updated speech recognition text for the NLU model to respond.

It can be appreciated that when the semantics of current speech recognition text are incomplete or the domain does not conform, the user's speech request may be complex or lengthy, requiring more complete audio information for recognition. In this step, after the rolling decoding path and the rolling splicing probability matrix, according to the new voice request received in real time in step S410, the original first probability matrix corresponding to the splicing start point is used as an interruption point, and the non-original splicing probability matrix is used as an interruption point, that is, the new first probability matrix is used as the latest input to enter the voice decoding diagram, so that the decoding path in the rolling decoding state continues to be searched in an extended manner, and an updated voice recognition text is obtained. Accordingly, the original voice recognition text that has been displayed on the screen in step S460 is modified accordingly.

It can be understood that if the judgment result of the updated speech recognition text is in accordance, the NLU model can respond; if the judgment result of the updated speech recognition text is still not accordant, the step is repeatedly executed in a circulating way.

And S490, when the voice request is finished or the judgment result is coincidence, the NLU model responds according to the current voice recognition text.

It can be understood that when the voice request is finished, all the voice recognition texts are completely recognized and output, and the NLU model can respond according to the latest complete voice recognition text.

When the voice request is not finished and the judgment result is in accordance, namely the semantics of the current incomplete voice recognition text are complete and the fields are in accordance, the NLU model can respond in advance without waiting for the complete voice recognition text to be output, and then the response speed is improved.

In summary, in the speech recognition method of the present application, the pre-trained acoustic model adopts two different output layers to output the first probability matrix and the second probability matrix respectively, and the second probability matrix delayed by 0 is spliced with the delayed first probability matrix, so as to compensate for the delay of the first probability matrix; the spliced probability matrix is input into a voice decoding image to obtain a voice recognition text, so that the extremely-fast on-screen display can be realized, and the 0-time delay is ensured. Meanwhile, in order to facilitate the NLU model to respond in advance, the NLU model can quickly respond to the conforming speech recognition text through semantic integrity judgment and field judgment, and response efficiency is improved; and for the non-conforming speech recognition text, recognizing according to more continuously received speech requests, and rolling back the state of the decoding path according to the new first probability matrix, wherein the speech recognition text displayed on the screen is correspondingly corrected, so that the correct speech recognition text can be ensured to be obtained for the response of the NLU model.

Corresponding to the embodiment of the application function implementation method, the application also provides a voice recognition device, a vehicle and a corresponding embodiment.

Fig. 7 is a schematic configuration diagram of a speech recognition apparatus shown in the present application.

Referring to fig. 7, the present application illustrates a speech recognition apparatus including a speech receiving module 710, a feature processing module 720, an acoustic processing module 730, and a decoding recognition module 740, wherein:

the voice receiving module 710 is used for receiving a voice request from a user in a vehicle cabin.

The feature processing module 720 is configured to perform feature extraction on the voice request to be recognized, and generate a feature vector corresponding to each frame of audio.

The acoustic processing module 730 is configured to output a first probability matrix of a corresponding frame through a first output layer of a pre-trained acoustic model according to the input feature vector, and output a second probability matrix of the corresponding frame through a second output layer of the acoustic model; the loss function of the first output layer is a CTC loss function, and the loss function of the second output layer is a CE loss function;

the decoding recognition module 740 is configured to obtain a splicing probability matrix of the first probability matrix and the second probability matrix, input the splicing probability matrix into a speech decoding graph of the speech recognition system, search for a decoding path, output a speech recognition text, and display the speech recognition text on a graphical user interface of the vehicle-mounted system.

Referring to fig. 8, in a specific embodiment, the decoding identification module 740 includes a concatenation sub-module 741 and a decoding sub-module 742. The splicing submodule 741 is configured to obtain corresponding first peak paths and frame number positions corresponding to the first peaks according to the first probability matrix of the current accumulated frame; obtaining corresponding second peak paths and frame number positions corresponding to the second peaks according to the second probability matrix of the current accumulated frame; and splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path and the second peak path and a preset rule to obtain a splicing probability matrix.

In a specific embodiment, the splicing sub-module 741 is configured to align the first spike path with the second spike path, and determine a spike with the same label as a splicing start point; determining a first peak used for splicing in the first peak path and a second peak used for splicing in the second peak path by a preset frame number according to the splicing starting point; and splicing the first probability matrix of the frame number position corresponding to the first peak and the second probability matrix of the frame number position corresponding to the second peak to obtain a splicing probability matrix.

In a specific embodiment, the decoding sub-module 742 is configured to search for a decoding path in the speech decoding graph according to the input concatenation probability matrix to output the speech recognition text. The decoding sub-module 742 is further configured to roll back the decoding path of the speech decoding graph to a corresponding decoding state according to the splicing start point.

In some embodiments, the speech recognition apparatus further includes a determining module 750, and the determining module 750 is configured to determine the semantic integrity and the domain attribute of the current speech recognition text when the speech request is not ended, and obtain a determination result. When the judgment result is not in accordance with the judgment result, the decoding submodule is used for rolling back the splicing probability matrix to the splicing starting point and rolling back the decoding path of the voice decoding image to the corresponding decoding state; and inputting the newly received first probability matrix into the voice decoding graph according to the splicing starting point, searching to obtain a decoding path, and outputting updated voice recognition text for the NLU model to respond.

The voice recognition device can rapidly obtain the voice recognition text without time delay and timely display the voice recognition text on the graphical user interface of the vehicle-mounted system, so that the recognition efficiency is improved, and the user experience is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 9, the vehicle 1000 includes a memory 1010 and a processor 1020.

The Processor 1020 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1010 may include various types of storage units, such as system memory, read Only Memory (ROM), and a persistent storage device. The ROM may store, among other things, static data or instructions for the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at run-time. Further, the memory 1010 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash, programmable read only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, may cause the processor 1020 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having executable code (or a computer program or computer instruction code) stored thereon, which, when executed by a processor of a server (or server, etc.), causes the processor to perform some or all of the various steps of the above-described methods according to the present application.

The foregoing description of the embodiments of the present application has been presented for purposes of illustration and description and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A speech recognition method, comprising:

receiving a voice request sent by a user in a vehicle cabin;

2. The method of claim 1, wherein obtaining the joint probability matrix of the first probability matrix and the second probability matrix comprises:

obtaining corresponding first peak paths and frame number positions corresponding to the first peaks according to the first probability matrix of the current accumulated frame;

according to the second probability matrix of the current accumulated frame, obtaining corresponding second peak paths and frame number positions corresponding to each second peak;

and splicing the first probability matrix and the second probability matrix corresponding to the frame number position according to the first peak path and the second peak path and a preset rule to obtain the spliced probability matrix.

3. The method according to claim 2, wherein the obtaining the joint probability matrix by joining a first probability matrix corresponding to the frame number position with a second probability matrix according to a preset rule according to the first peak path and the second peak path comprises:

aligning the first peak path and the second peak path, and determining a peak with the same label as a splicing starting point;

determining a first peak used for splicing in the first peak path and a second peak used for splicing in a preset frame number in the second peak path according to the splicing starting point;

and splicing a first probability matrix of the frame number position corresponding to the first peak and a second probability matrix of the frame number position corresponding to the second peak to obtain the spliced probability matrix.

4. The method of claim 3, wherein after obtaining the concatenation probability matrix, further comprising:

5. The method of claim 1, wherein the acoustic model is pre-trained according to:

inputting the feature sequence of the same training audio and the corresponding first phoneme label sequence and second phoneme label sequence into an acoustic model to be trained for training to obtain a trained acoustic model; the acoustic model is provided with a shared hidden layer and a first output layer and a second output layer which are independent, and the acoustic model carries out back propagation according to the loss values of a CTC loss function of the first output layer and a CE loss function of the second output layer respectively so as to iteratively adjust the weight parameters of the hidden layer until the loss values of the CTC loss function and the CE loss function are converged.

6. The method of claim 5, wherein obtaining a second sequence of CE and peaking-based phoneme labels for the training audio comprises:

carrying out forced frame alignment on the training audio and the original phoneme label sequence based on the CE to obtain an alignment result;

and performing spiking processing on the labels with the same phoneme in the original phoneme label sequence, so that the repeated labels are replaced by blank labels and the remained single phoneme labels are reserved, thereby obtaining a first phoneme label sequence.

7. The method of claim 1, after outputting the speech recognized text, further comprising:

when the voice request is not finished, judging the semantic integrity and the field attribute of the current voice recognition text to obtain a judgment result;

when the judgment result is not in accordance with the judgment result, rolling back the splicing probability matrix to the splicing starting point, and rolling back the decoding path of the voice decoding graph to a corresponding decoding state;

and inputting the newly received first probability matrix into the voice decoding graph according to the splicing starting point to search for a decoding path so as to output updated voice recognition text for the response of the NLU model.

8. The method of claim 7, further comprising:

9. A speech recognition apparatus, comprising:

10. A vehicle, characterized by comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-8.

11. A computer readable storage medium having stored thereon executable code which, when executed by a processor of a vehicle, causes the processor to perform the method of any one of claims 1-8.