CN109637525B

CN109637525B - Method and apparatus for generating an on-board acoustic model

Info

Publication number: CN109637525B
Application number: CN201910075039.2A
Authority: CN
Inventors: 孙建伟; 李超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2020-06-09
Anticipated expiration: 2039-01-25
Also published as: CN109637525A

Abstract

Embodiments of the present disclosure disclose methods and apparatus for generating an in-vehicle acoustic model. One embodiment of the method comprises: selecting an acoustic model from a pre-trained acoustic model group as an initial acoustic model; acquiring a pre-generated training sample set, wherein the training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data; based on the initial acoustic model, taking sample vehicle-mounted voice data in training samples in the training sample set as input, taking a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain a vehicle-mounted acoustic model. The embodiment enriches the generation mode of the model.

Description

Method and apparatus for generating an on-board acoustic model

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for generating a vehicle-mounted acoustic model.

Background

In the related art, before an acoustic model is trained, in a training sample preparation stage, a large amount of real voice data is required to be manually collected from a real scene in which the acoustic model is possibly used. So as to train the acoustic model by using the collected real voice data.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for generating an in-vehicle acoustic model.

In a first aspect, an embodiment of the present disclosure provides a method for generating a vehicle-mounted acoustic model, the method including: selecting an acoustic model from a pre-trained acoustic model group as an initial acoustic model; acquiring a pre-generated training sample set, wherein the training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data; based on the initial acoustic model, taking sample vehicle-mounted voice data in training samples in the training sample set as input, taking a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain a vehicle-mounted acoustic model.

In some embodiments, training a vehicle-mounted acoustic model based on an initial acoustic model, with sample vehicle-mounted voice data in training samples in a training sample set as an input and a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as an expected output, includes: selecting training samples from a training sample set, and executing the following training steps: inputting the sample vehicle-mounted voice data in the selected training sample into an initial acoustic model to obtain actual output; adjusting parameters of the initial acoustic model according to a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained difference of actual output so as to obtain an adjusted initial acoustic model; determining whether the unselected training samples exist in the training sample set; in response to determining that there is no acoustic model, determining the adjusted initial acoustic model as the in-vehicle acoustic model; in response to determining that the training is present, the training step is continued by selecting unselected training samples from the set of training samples using the adjusted initial acoustic model as the initial acoustic model.

In some embodiments, adjusting parameters of the initial acoustic model based on a difference between a sample in-vehicle speech recognition result corresponding to the input sample in-vehicle speech data and the obtained actual output includes: inputting a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained actual output into a predetermined loss function to obtain a loss value; in response to determining that the resulting loss value is greater than a preset loss threshold, parameters of the initial acoustic model are adjusted.

In some embodiments, selecting an acoustic model from a pre-trained set of acoustic models comprises: acquiring a pre-stored test sample set, wherein the test sample comprises real vehicle-mounted voice data and a real vehicle-mounted voice recognition result corresponding to the real vehicle-mounted voice data; for the acoustic models in the acoustic model group, the following statistical steps are performed: for the test samples in the test sample set, inputting real vehicle-mounted voice data in the test samples into the acoustic model to obtain a model output result; determining the recognition rate of the acoustic model to the test sample according to the real vehicle-mounted voice recognition result in the test sample and the obtained model output result; storing the determined recognition rate into a recognition rate set; and determining the acoustic model from the acoustic model group according to the recognition rate set group corresponding to the acoustic model group.

In some embodiments, the training sample set is generated by: acquiring voice data under at least one voice interaction scene to obtain a voice data set; for voice data in the voice data set, generating simulated voice data corresponding to the voice data as sample vehicle-mounted voice data based on a predetermined vehicle-mounted impulse response data set and a predetermined vehicle-mounted noise data set, and storing the sample vehicle-mounted voice data into a sample vehicle-mounted voice data set; for sample vehicle-mounted voice data in the sample vehicle-mounted voice data set, marking the sample vehicle-mounted voice data to obtain a sample vehicle-mounted voice recognition result; and taking the sample vehicle-mounted voice data and the obtained sample vehicle-mounted voice recognition result as training samples and storing the training samples into a training sample set.

In a second aspect, an embodiment of the present disclosure provides a method for recognizing speech, the method including: receiving vehicle-mounted voice data; and inputting the vehicle-mounted voice data into the vehicle-mounted acoustic model generated by adopting the method described in any embodiment of the first aspect to obtain a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data.

In a third aspect, an embodiment of the present disclosure provides an apparatus for generating an acoustic model for vehicle, the apparatus including a model selection unit configured to select an acoustic model from a pre-trained acoustic model group as an initial acoustic model; a sample acquisition unit configured to acquire a pre-generated training sample set, wherein the training sample includes sample onboard voice data and a sample onboard voice recognition result corresponding to the sample onboard voice data; and the model training unit is configured to take the sample vehicle-mounted voice data in the training samples in the training sample set as input based on the initial acoustic model, take the sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and train the expected output to obtain the vehicle-mounted acoustic model.

In some embodiments, the model training unit is further configured to: selecting training samples from a training sample set, and executing the following training steps: inputting the sample vehicle-mounted voice data in the selected training sample into an initial acoustic model to obtain actual output; adjusting parameters of the initial acoustic model according to a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained difference of actual output so as to obtain an adjusted initial acoustic model; determining whether the unselected training samples exist in the training sample set; in response to determining that there is no acoustic model, determining the adjusted initial acoustic model as the in-vehicle acoustic model; in response to determining that the training is present, the training step is continued by selecting unselected training samples from the set of training samples using the adjusted initial acoustic model as the initial acoustic model.

In some embodiments, the adjusting, in the model training unit, the parameters of the initial acoustic model according to the difference between the sample in-vehicle speech recognition result corresponding to the input sample in-vehicle speech data and the obtained actual output includes: inputting a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained actual output into a predetermined loss function to obtain a loss value; in response to determining that the resulting loss value is greater than a preset loss threshold, parameters of the initial acoustic model are adjusted.

In some embodiments, the model selection unit is further configured to: acquiring a pre-stored test sample set, wherein the test sample comprises real vehicle-mounted voice data and a real vehicle-mounted voice recognition result corresponding to the real vehicle-mounted voice data; for the acoustic models in the acoustic model group, the following statistical steps are performed: for the test samples in the test sample set, inputting real vehicle-mounted voice data in the test samples into the acoustic model to obtain a model output result; determining the recognition rate of the acoustic model to the test sample according to the real vehicle-mounted voice recognition result in the test sample and the obtained model output result; storing the determined recognition rate into a recognition rate set; and determining the acoustic model from the acoustic model group according to the recognition rate set group corresponding to the acoustic model group.

In some embodiments, in the sample acquiring unit, the training sample set is generated by: acquiring voice data under at least one voice interaction scene to obtain a voice data set; for voice data in the voice data set, generating simulated voice data corresponding to the voice data as sample vehicle-mounted voice data based on a predetermined vehicle-mounted impulse response data set and a predetermined vehicle-mounted noise data set, and storing the sample vehicle-mounted voice data into a sample vehicle-mounted voice data set; for sample vehicle-mounted voice data in the sample vehicle-mounted voice data set, marking the sample vehicle-mounted voice data to obtain a sample vehicle-mounted voice recognition result; and taking the sample vehicle-mounted voice data and the obtained sample vehicle-mounted voice recognition result as training samples and storing the training samples into a training sample set.

In a fourth aspect, an embodiment of the present disclosure provides an apparatus for recognizing speech, the apparatus including: a voice receiving unit configured to receive in-vehicle voice data; and a voice recognition unit configured to input the vehicle-mounted voice data into the vehicle-mounted acoustic model generated by adopting the method described in any one of the embodiments of the first aspect, and obtain a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as described in any of the embodiments of the first and second aspects above.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method as described in any of the embodiments of the first and second aspects above.

The method and the device for generating the vehicle-mounted acoustic model provided by the embodiment of the disclosure can select the acoustic model from a pre-trained acoustic model group as an initial acoustic model. Then, a pre-generated training sample set is obtained. The training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data. And finally, based on the initial acoustic model, taking the sample vehicle-mounted voice data in the training samples in the training sample set as input, taking the sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain the vehicle-mounted acoustic model. According to the method and the device, the acoustic model is selected from the acoustic model group trained in advance to serve as the initial acoustic model for training, so that the vehicle-mounted acoustic model is generated, and the generation mode of the model is enriched.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating a vehicle-mounted acoustic model according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating an in-vehicle acoustic model according to an embodiment of the present disclosure;

FIG. 4 is a schematic block diagram of one embodiment of an apparatus for generating an on-board acoustic model according to the present disclosure;

FIG. 5 is a flow diagram for one embodiment of a method for recognizing speech according to the present disclosure;

FIG. 6 is a schematic block diagram illustration of one embodiment of an apparatus for recognizing speech according to the present disclosure;

FIG. 7 is a schematic diagram of an electronic device suitable for use in implementing the disclosed embodiments.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of a method for generating an in-vehicle acoustic model or an apparatus for generating an in-vehicle acoustic model to which embodiments of the present disclosure may be applied.

As shown in FIG. 1, system architecture 100 may include a database server 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between database server 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

Database server 101 may interact with server 103 over network 102 to receive or send messages and the like. Database server 101 may be implemented as a distributed cluster of servers that provide various data storage services, or as a single server. For example, a server storing a set of training samples. The database server 101 may send the stored training sample set to the server.

Server 103 may interact with database server 101 through network 102 to receive or send messages and the like. The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster providing various information processing services, or may be implemented as a single server. For example, a server that trains a pre-trained acoustic model using a training sample set. The server may select an acoustic model from a pre-trained set of acoustic models as the initial acoustic model. Then, a pre-generated training sample set is obtained. The training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data. And finally, based on the initial acoustic model, taking the sample vehicle-mounted voice data in the training samples in the training sample set as input, taking the sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain the vehicle-mounted acoustic model. When the server 103 is software, it may be installed in the above-listed servers. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating the vehicle-mounted acoustic model provided by the embodiment of the present disclosure is generally performed by the server 103, and accordingly, the apparatus for generating the vehicle-mounted acoustic model is generally disposed in the server 103. It should be noted that the training sample set may also be stored locally on server 103, and the training sample set may be extracted locally by server 103, in which case exemplary system architecture 100 may not include database server 101 and network 102.

It should be understood that the number of database servers, networks, and servers in FIG. 1 is illustrative only. There may be any number of database servers, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating a vehicle-mounted acoustic model according to the present disclosure is shown. The method for generating the vehicle-mounted acoustic model comprises the following steps:

step 201, selecting an acoustic model from a pre-trained acoustic model group as an initial acoustic model.

In the present embodiment, an execution subject (e.g., the server 103 shown in fig. 1) of the method for generating an in-vehicle acoustic model may select an acoustic model from a pre-trained acoustic model group as an initial acoustic model in various ways. As an example, the executive may randomly select one acoustic model from the set of acoustic models as the initial acoustic model. The acoustic models in the acoustic model group are generally used for characterizing the correspondence between the speech data and the speech recognition result. The acoustic model may be a correspondence table that is generated based on statistics of a large amount of speech data and recognition results and stores a plurality of correspondences between speech data and speech recognition results. The acoustic model may be a model obtained by training an initial model (for example, a Convolutional Neural Network (CNN), a residual error Network (ResNet), or the like) by a machine learning method based on a training sample.

In some optional implementations of this embodiment, selecting an acoustic model from a pre-trained acoustic model group may include:

in a first step, a pre-stored set of test samples is obtained. The test sample comprises real vehicle-mounted voice data and a real vehicle-mounted voice recognition result corresponding to the real vehicle-mounted voice data. Here, the real in-vehicle voice data is generally voice data collected in an in-vehicle environment. The real vehicle-mounted voice recognition result usually records the actual content of the real vehicle-mounted voice data.

Secondly, for the acoustic models in the acoustic model group, the following statistical steps are performed: firstly, for a test sample in a test sample set, inputting real vehicle-mounted voice data in the test sample into the acoustic model to obtain a model output result. Here, the executing subject may input each test sample in the test sample set into the acoustic model to obtain a model output result of the acoustic model for each test sample. And then, determining the recognition rate of the acoustic model to the test sample according to the real vehicle-mounted voice recognition result in the test sample and the obtained model output result. Here, the executing subject may compare the model output result of the acoustic model for the test sample with the real vehicle-mounted speech recognition result in the test sample. If the output result of the model is the same as the real vehicle-mounted voice recognition result, the recognition is considered to be correct, and at the moment, the recognition rate can be 100%. If not, the recognition is not correct, and in this case, the recognition rate may be 0. Optionally, the execution subject may further calculate the similarity between the model output result and the real vehicle-mounted speech recognition result by using a similarity calculation formula, and the similarity between the model output result and the real vehicle-mounted speech recognition result is used as the recognition rate. As an example, the above-described similarity calculation formula may be a cosine similarity calculation formula. Finally, the determined recognition rate is stored in a recognition rate set.

And thirdly, determining the acoustic model from the acoustic model group according to the recognition rate set group corresponding to the acoustic model group. Here, each acoustic model in the set of acoustic models corresponds to a set of recognition rates. The executing subject may first calculate a sum of the recognition rates in each recognition rate set, and then select the acoustic model corresponding to the largest sum from the acoustic model group.

Step 202, a pre-generated training sample set is obtained.

The training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data.

It should be noted that the training sample set may be directly stored locally, or may be stored in other electronic devices communicatively connected to the executing entity. When the training sample set is stored locally, the executing agent may directly extract the locally stored training sample set for processing. When the training sample set is stored in other electronic equipment in communication connection with the execution subject, the execution subject may acquire the training sample set for processing through a wired connection manner or a wireless connection manner.

In some optional implementations of this embodiment, the training sample set may be the execution subject or other execution subjects for generating the training sample set, and is generated by the following steps:

the method comprises the steps of firstly, obtaining voice data under at least one voice interaction scene to obtain a voice data set. Here, the execution subject may acquire voice data in a plurality of voice interaction scenarios through a microphone, a mobile phone, or other devices capable of collecting voice. The voice interaction scene may refer to various scenes in which a person performs voice interaction with a machine through voice. As an example, the voice interaction scenario may be a scenario in which a person performs voice interaction with a map application through voice to realize map navigation. As yet another example, the voice interaction scene may also be a scene in which a person performs voice interaction with a sound box through voice to realize music playing. It should be noted that the obtained voice data in the voice interaction scenario generally refers to voice data uttered by a person in the voice interaction scenario.

And secondly, generating simulated voice data corresponding to the voice data as sample vehicle-mounted voice data based on a predetermined vehicle-mounted impulse response data set and a predetermined vehicle-mounted noise data set for the voice data in the voice data set, and storing the simulated voice data into the sample vehicle-mounted voice data set. Here, the above-mentioned on-board impulse response data is an output signal generated at a receiving point after an impulse function as an input signal is propagated and reflected in a space formed by the vehicle. The impulse function is a function in which the signal intensity is zero at a point other than zero and the integral over the entire domain is equal to 1. The vehicle-mounted noise data may be noise data measured in a vehicle in different application scenarios. As an example, the application scene may be a sunny scene or a rainy scene. The simulated voice data in the simulated voice data set generally refers to simulated voice data which is synthesized by adopting a voice synthesis technology and simulates voice in a vehicle-mounted environment.

Optionally, the execution subject may generate the simulated voice data corresponding to the voice data by: for the vehicle-mounted impulse response data in the vehicle-mounted impulse response data set, the following selection steps are executed: and selecting the vehicle-mounted noise data from the vehicle-mounted noise data set, and executing the following storage steps: and substituting the vehicle-mounted impulse response data, the voice data and the selected vehicle-mounted noise data into a predetermined simulation data determining function to generate simulation voice data, and storing the simulation voice data into a simulation voice data set. And determining whether the unselected vehicle-mounted noise data exist in the vehicle-mounted noise data set. In response to determining that there is any vehicle-mounted noise data, the vehicle-mounted noise data that is not selected from the set of vehicle-mounted noise data is selected to continue the storing step.

The simulation data determining function is used for representing the corresponding relation between voice data, vehicle-mounted impulse response data, vehicle-mounted noise data and simulation voice data. Alternatively, the expression of the simulation data determination function may be y ═ h × x + u. Wherein y is simulation voice data, h is vehicle-mounted impulse response data, x is voice data, x is convolution operation, + is signal superposition operation, and u is vehicle-mounted noise data.

And thirdly, for the sample vehicle-mounted voice data in the sample vehicle-mounted voice data set, firstly, marking the sample vehicle-mounted voice data to obtain a sample vehicle-mounted voice recognition result. And then, storing the sample vehicle-mounted voice data and the obtained sample vehicle-mounted voice recognition result as training samples into a training sample set.

It is to be noted that, in the respective embodiments of the present disclosure, the vehicle described above may be various vehicles. Such as an unmanned vehicle. The vehicle may be a variety of other vehicles. Such as an aircraft, a ship.

And 203, based on the initial acoustic model, taking the sample vehicle-mounted voice data in the training samples in the training sample set as input, taking the sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain the vehicle-mounted acoustic model.

The vehicle-mounted acoustic model is used for representing the corresponding relation between the voice data and the voice recognition result.

In this embodiment, the execution subject may train to obtain the vehicle-mounted acoustic model by:

the method comprises the following steps of firstly, selecting samples from a training sample set, and executing the following training steps: firstly, inputting sample vehicle-mounted voice data in a selected training sample into an initial acoustic model to obtain actual output. Then, the sample in-vehicle voice recognition result corresponding to the input sample in-vehicle voice data is compared with the obtained actual output. And determining whether the initial acoustic model is trained according to the comparison result. Here, the comparison between the sample in-vehicle voice recognition result corresponding to the input sample in-vehicle voice data and the obtained actual output may be a comparison of whether or not a difference between the two is smaller than a set difference threshold. If the value is less than the predetermined value, the training is considered to be completed. If not, it can be considered that no training is completed. In addition, the above-described comparison between the sample in-vehicle speech recognition result corresponding to the input sample in-vehicle speech data and the obtained actual output may be performed to compare whether or not the similarity between the two exceeds a set similarity threshold. If so, the training may be considered complete. If not, it may be considered that no training is complete. Finally, in response to determining that the initial acoustic model training is complete, the initial acoustic model is taken as the in-vehicle acoustic model.

And secondly, in response to the fact that the initial acoustic model is determined not to be trained completely, adjusting parameters of the initial acoustic model, reselecting samples from the training sample set, using the adjusted initial acoustic model as the initial acoustic model, and continuing to execute the training step. Here, the execution subject may adjust the parameters of the initial acoustic model in a parameter adjustment manner set in advance. As an example, the executive may adjust the parameters of the initial acoustic model by a set magnitude per increment. For example, if a parameter of the model is m before adjustment, the parameter becomes m + h after adjustment. And when the adjustment is performed again, adjusting the adjustment to m + h + h. And so on.

In some optional implementations of this embodiment, the executing subject may further train to obtain the vehicle-mounted acoustic model by:

firstly, selecting training samples from a training sample set, and executing the following training steps: firstly, inputting sample vehicle-mounted voice data in a selected training sample into an initial acoustic model to obtain actual output. Then, parameters of the initial acoustic model are adjusted according to a sample onboard voice recognition result corresponding to the input sample onboard voice data and the obtained difference of the actual output. Here, the execution subject may adjust the parameters of the initial acoustic model in a parameter adjustment manner set in advance. As an example, the executive may adjust the parameters of the initial acoustic model by a set magnitude per reduction. For example, if a parameter of the model is m before tuning, it becomes m-h after tuning. And when the adjustment is performed again, adjusting the adjustment to m-h-h. And so on. And then, determining whether the training sample set has the training samples which are not selected, and determining the adjusted initial acoustic model as the vehicle-mounted acoustic model in response to the determination that the training samples do not exist.

Optionally, adjusting parameters of the initial acoustic model according to a difference between a sample onboard voice recognition result corresponding to the input sample onboard voice data and the obtained actual output, includes: first, a sample vehicle-mounted voice recognition result corresponding to input sample vehicle-mounted voice data and the obtained actual output are input to a predetermined loss function, and a loss value is obtained. The above-mentioned loss function is a function for describing a degree of inconsistency between an actual output and a desired output in general. Alternatively, the above loss function may be a connection timing classification loss function (current technologies corporation). It is to be noted that, when the loss function may be a connection timing classification loss function, the execution subject may obtain a difference between the sample in-vehicle speech recognition result and the obtained actual output by counting a difference between each sentence in the actual output and a sentence at a corresponding position of the corresponding sample in-vehicle speech recognition result. Compared with the prior art of counting phonemes in sentences, the comparison time is shorter because the difference between sentences is counted. Therefore, the model training method in the embodiment can save training time consumed in the model training process. Then, in response to determining that the resulting loss value is greater than a preset loss threshold, parameters of the initial acoustic model are adjusted. Here, when the obtained loss value is greater than a preset loss threshold, the execution subject may adjust the parameters of the initial acoustic model in various ways. As an example, the executing subject may adopt a gradient descent method, calculate a gradient of the loss function to a parameter in the initial acoustic model, then determine a variation of the parameter of the initial acoustic model according to the gradient, and superimpose the parameter and the variation to form an adjusted parameter. The preset loss threshold may be a value preset by a technician.

And secondly, in response to the determination of existence, selecting unselected training samples from the training sample set by using the adjusted initial acoustic model as an initial acoustic model, and continuing to execute the training step.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating an in-vehicle acoustic model according to the present embodiment.

In the application scenario of fig. 3, first, the server 301 selects an acoustic model M2 as an initial acoustic model from a pre-trained set of acoustic models { M1, M2, M3 }.

Then, the server 301 obtains a training sample set { (X1, Y1), (X2, Y2), (X3, Y3), (X4, Y4) } from the database server 302, where X1, X2, X3, and X4 are sample car-mounted voice data, respectively. Y1 is a sample in-vehicle speech recognition result corresponding to X1. Y2 is a sample in-vehicle speech recognition result corresponding to X2. Y3 is a sample in-vehicle speech recognition result corresponding to X3. Y4 is a sample in-vehicle speech recognition result corresponding to X4.

Finally, the server 301 trains sample onboard speech data X1, X2, X3, and X4 in training samples in the training sample set { (X1, Y1), (X2, Y2), (X3, Y3), (X4, Y4) } as input, and sample onboard speech recognition results Y1, Y2, Y3, and Y4 corresponding to the input sample onboard speech data as desired output, based on the acoustic model M2, to obtain an onboard acoustic model.

The method for generating the vehicle-mounted acoustic model provided by the above-mentioned embodiment of the present disclosure may select an acoustic model from a pre-trained acoustic model group as the initial acoustic model. Then, a pre-generated training sample set is obtained. The training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data. And finally, based on the initial acoustic model, taking the sample vehicle-mounted voice data in the training samples in the training sample set as input, taking the sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain the vehicle-mounted acoustic model. According to the method, the acoustic model is selected from the acoustic model group trained in advance to serve as the initial acoustic model for training, so that the vehicle-mounted acoustic model is generated, and the generation mode of the model is enriched.

With further reference to fig. 4, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating an acoustic model for a vehicle, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 4, the apparatus 400 for generating a vehicle-mounted acoustic model of the present embodiment includes: a model selection unit 401 configured to select an acoustic model from a pre-trained acoustic model group as an initial acoustic model; a sample acquisition unit 402 configured to acquire a pre-generated training sample set, wherein the training samples include sample in-vehicle voice data and a sample in-vehicle voice recognition result corresponding to the sample in-vehicle voice data; and a model training unit 403 configured to train a vehicle-mounted acoustic model by taking, as an input, sample vehicle-mounted voice data in training samples in the training sample set and taking, as an expected output, a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data, based on the initial acoustic model.

In some optional implementations of this embodiment, the model training unit 403 may be further configured to: the first step is as follows: selecting training samples from a training sample set, and executing the following training steps: firstly, inputting sample vehicle-mounted voice data in a selected training sample into an initial acoustic model to obtain actual output. And then, adjusting parameters of the initial acoustic model according to the sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained difference of the actual output so as to obtain the adjusted initial acoustic model. And finally, determining whether the unselected training samples exist in the training sample set. In response to determining the absence, determining the adjusted initial acoustic model as the in-vehicle acoustic model. The second step is that: in response to determining that the training is present, the training step is continued by selecting unselected training samples from the set of training samples using the adjusted initial acoustic model as the initial acoustic model.

In some optional implementations of this embodiment, in the model training unit 403, the adjusting parameters of the initial acoustic model according to the difference between the sample vehicle-mounted speech recognition result corresponding to the input sample vehicle-mounted speech data and the obtained actual output may include: first, a sample vehicle-mounted voice recognition result corresponding to input sample vehicle-mounted voice data and the obtained actual output are input to a predetermined loss function, and a loss value is obtained. Then, in response to determining that the resulting loss value is greater than a preset loss threshold, parameters of the initial acoustic model are adjusted.

In some optional implementations of this embodiment, the model selecting unit 401 may be further configured to: in a first step, a pre-stored set of test samples is obtained. The test sample comprises real vehicle-mounted voice data and a real vehicle-mounted voice recognition result corresponding to the real vehicle-mounted voice data. Secondly, for the acoustic models in the acoustic model group, the following statistical steps are performed: firstly, for a test sample in a test sample set, inputting real vehicle-mounted voice data in the test sample into the acoustic model to obtain a model output result. And then, determining the recognition rate of the acoustic model to the test sample according to the real vehicle-mounted voice recognition result in the test sample and the obtained model output result. Finally, the determined recognition rate is stored in a recognition rate set. And thirdly, determining the acoustic model from the acoustic model group according to the recognition rate set group corresponding to the acoustic model group.

In some optional implementations of this embodiment, in the sample obtaining unit 402, the training sample set may be generated by: the method comprises the steps of firstly, obtaining voice data under at least one voice interaction scene to obtain a voice data set. And secondly, generating simulated voice data corresponding to the voice data as sample vehicle-mounted voice data based on a predetermined vehicle-mounted impulse response data set and a predetermined vehicle-mounted noise data set for the voice data in the voice data set, and storing the simulated voice data into the sample vehicle-mounted voice data set. And thirdly, labeling the sample vehicle-mounted voice data in the sample vehicle-mounted voice data set to obtain a sample vehicle-mounted voice recognition result. And taking the sample vehicle-mounted voice data and the obtained sample vehicle-mounted voice recognition result as training samples and storing the training samples into a training sample set.

It will be understood that the elements described in the apparatus 400 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 400 and the units included therein, and will not be described herein again.

Referring to fig. 5, a flow 500 of one embodiment of a method for recognizing speech provided by the present disclosure is shown. The method for recognizing speech may include the steps of:

step 501, vehicle-mounted voice data is received.

In the present embodiment, the execution subject (for example, in-vehicle terminal) of the method for recognizing a voice may receive in-vehicle voice data by a device having a voice receiving function such as a microphone which is provided by itself or is connected in communication with the execution subject. The vehicle-mounted voice data is voice data corresponding to voice uttered by a user in a vehicle-mounted environment.

Step 502, inputting the vehicle-mounted voice data into the vehicle-mounted acoustic model to obtain a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data.

In this embodiment, the vehicle-mounted acoustic model is used to represent the corresponding relationship between the vehicle-mounted voice data and the vehicle-mounted voice recognition result. After the execution main body inputs the received vehicle-mounted voice data into the vehicle-mounted acoustic model, a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data can be obtained.

In the present embodiment, the vehicle-mounted acoustic model may be generated by using the method described in the embodiment of fig. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.

It should be noted that the method for recognizing speech according to the present embodiment can be used for testing the vehicle-mounted acoustic model generated according to the above embodiments. And then the vehicle-mounted acoustic model can be continuously optimized according to the test result. The method may also be a practical application method of the vehicle-mounted acoustic model generated by the above embodiments. The vehicle-mounted acoustic model generated by the embodiments is adopted to perform voice recognition, and the performance of voice recognition in a vehicle-mounted environment is improved. For example, the in-vehicle voice data can be recognized more quickly.

It should be noted that, generally, before training an acoustic model, in a training sample preparation stage, a large amount of real voice data is often acquired manually from a real scene in which the acoustic model may be used, so that the acoustic model is trained by using the acquired real voice data. The more training samples, the more accurate the resulting acoustic model is at speech recognition. However, at present, the development time of voice recognition related products for vehicle-mounted scenes is short, and the time for users to contact the products is not long, so that the real voice data accumulated in the real scenes at the present stage is extremely limited. The method provided by the above embodiment of the present disclosure may adopt voice data in at least one voice interaction scenario. And then generating simulated voice data corresponding to each voice data by utilizing the voice data under a plurality of voice interaction scenes, a predetermined vehicle-mounted impulse response data set and a predetermined vehicle-mounted noise data set. Therefore, abundant training samples which are close to the vehicle-mounted scene are obtained. In addition, according to the method provided by the embodiment of the disclosure, the acoustic model with the highest voice data recognition rate in the vehicle-mounted scene is selected from the acoustic model group trained in advance to serve as the initial acoustic model for training, and the method is favorable for accelerating the model training efficiency. Through the analysis, the vehicle-mounted acoustic model adopted in the embodiment can more accurately identify the vehicle-mounted voice data.

With continuing reference to FIG. 6, as an implementation of the method illustrated in FIG. 5 above, the present disclosure provides one embodiment of an apparatus for recognizing speech. The embodiment of the device corresponds to the embodiment of the method shown in fig. 5, and the device can be applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for recognizing a speech of the present embodiment may include: a voice receiving unit 601 configured to receive in-vehicle voice data; and a voice recognition unit 602 configured to input the vehicle-mounted voice data into the vehicle-mounted acoustic model generated by the method described in the embodiment of fig. 2, and obtain a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data.

It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 5. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.

Referring now to FIG. 7, a block diagram of an electronic device (e.g., the server of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a Central Processing Unit (CPU), a graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: selecting an acoustic model from a pre-trained acoustic model group as an initial acoustic model; acquiring a pre-generated training sample set, wherein the training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data; based on the initial acoustic model, taking sample vehicle-mounted voice data in training samples in the training sample set as input, taking a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain a vehicle-mounted acoustic model.

Further, the one or more programs, when executed by the electronic device, may further cause the electronic device to: receiving vehicle-mounted voice data; and inputting the vehicle-mounted voice data into the vehicle-mounted acoustic model to obtain a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data. The vehicle-mounted acoustic model may be generated by using the method for generating the vehicle-mounted acoustic model described in the foregoing embodiments.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a model selection unit, a sample acquisition unit, and a model training unit. Where the names of the elements do not in some cases constitute a definition of the elements themselves, the model selection unit may also be described as "an element that selects an acoustic model from a pre-trained set of acoustic models as an initial acoustic model", for example.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for generating an in-vehicle acoustic model, comprising:

selecting an acoustic model from a pre-trained acoustic model group as an initial acoustic model, wherein the acoustic model in the pre-trained acoustic model group is used for representing the corresponding relation between the voice data and the voice recognition result;

acquiring a pre-generated training sample set, wherein the training sample comprises sample vehicle-mounted voice data and a sample vehicle-mounted voice recognition result corresponding to the sample vehicle-mounted voice data;

based on the initial acoustic model, taking sample vehicle-mounted voice data in training samples in the training sample set as input, taking a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output, and training to obtain a vehicle-mounted acoustic model;

wherein the selecting an acoustic model from a pre-trained set of acoustic models comprises: acquiring a pre-stored test sample set, wherein the test sample comprises real vehicle-mounted voice data and a real vehicle-mounted voice recognition result corresponding to the real vehicle-mounted voice data; for the acoustic models in the set of acoustic models, performing the following statistical steps: for the test samples in the test sample set, inputting real vehicle-mounted voice data in the test samples into the acoustic model to obtain a model output result; determining the recognition rate of the acoustic model to the test sample according to the real vehicle-mounted voice recognition result in the test sample and the obtained model output result; storing the determined recognition rate into a recognition rate set; and determining the acoustic model from the acoustic model group according to the recognition rate set group corresponding to the acoustic model group.

2. The method of claim 1, wherein training a vehicle-mounted acoustic model based on the initial acoustic model using, as input, sample vehicle-mounted speech data in training samples in the training sample set and using, as an expected output, a sample vehicle-mounted speech recognition result corresponding to the input sample vehicle-mounted speech data comprises:

selecting training samples from the training sample set, and executing the following training steps: inputting the sample vehicle-mounted voice data in the selected training sample into an initial acoustic model to obtain actual output; adjusting parameters of the initial acoustic model according to a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained difference of actual output so as to obtain an adjusted initial acoustic model; determining whether the unselected training samples exist in the training sample set; in response to determining that there is no acoustic model, determining the adjusted initial acoustic model as the in-vehicle acoustic model;

in response to determining that the training sample exists, selecting an unselected training sample from the set of training samples using the adjusted initial acoustic model as the initial acoustic model, and continuing to perform the training step.

3. The method of claim 2, wherein said adjusting parameters of the initial acoustic model based on differences between the sample in-vehicle speech recognition results corresponding to the input sample in-vehicle speech data and the resulting actual output comprises:

inputting a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data and the obtained actual output into a predetermined loss function to obtain a loss value;

in response to determining that the resulting loss value is greater than a preset loss threshold, parameters of the initial acoustic model are adjusted.

4. The method according to one of claims 1-3, wherein the training sample set is generated by:

acquiring voice data under at least one voice interaction scene to obtain a voice data set;

for voice data in the voice data set, generating simulated voice data corresponding to the voice data as sample vehicle-mounted voice data based on a predetermined vehicle-mounted impulse response data set and a predetermined vehicle-mounted noise data set, and storing the sample vehicle-mounted voice data into a sample vehicle-mounted voice data set;

for sample vehicle-mounted voice data in the sample vehicle-mounted voice data set, marking the sample vehicle-mounted voice data to obtain a sample vehicle-mounted voice recognition result; and taking the sample vehicle-mounted voice data and the obtained sample vehicle-mounted voice recognition result as training samples and storing the training samples into a training sample set.

5. A method for recognizing speech, comprising:

receiving vehicle-mounted voice data;

inputting the vehicle-mounted voice data into a vehicle-mounted acoustic model generated by adopting the method according to one of claims 1 to 4, and obtaining a vehicle-mounted voice recognition result corresponding to the vehicle-mounted voice data.

6. An apparatus for generating an in-vehicle acoustic model, comprising:

the model selection unit is configured to select an acoustic model from a pre-trained acoustic model group as an initial acoustic model, wherein the acoustic model in the pre-trained acoustic model group is used for representing the corresponding relation between the voice data and the voice recognition result;

a sample acquisition unit configured to acquire a pre-generated training sample set, wherein the training sample includes sample onboard voice data and a sample onboard voice recognition result corresponding to the sample onboard voice data;

a model training unit configured to train a vehicle-mounted acoustic model by taking sample vehicle-mounted voice data in training samples in the training sample set as input and taking a sample vehicle-mounted voice recognition result corresponding to the input sample vehicle-mounted voice data as expected output based on the initial acoustic model;

wherein the model selection unit is further configured to: acquiring a pre-stored test sample set, wherein the test sample comprises real vehicle-mounted voice data and a real vehicle-mounted voice recognition result corresponding to the real vehicle-mounted voice data; for the acoustic models in the set of acoustic models, performing the following statistical steps: for the test samples in the test sample set, inputting real vehicle-mounted voice data in the test samples into the acoustic model to obtain a model output result; determining the recognition rate of the acoustic model to the test sample according to the real vehicle-mounted voice recognition result in the test sample and the obtained model output result; storing the determined recognition rate into a recognition rate set; and determining the acoustic model from the acoustic model group according to the recognition rate set group corresponding to the acoustic model group.

7. The apparatus of claim 6, wherein the model training unit is further configured to:

8. The apparatus of claim 7, wherein the model training unit, wherein the adjusting parameters of the initial acoustic model according to the difference between the sample in-vehicle voice recognition result corresponding to the input sample in-vehicle voice data and the obtained actual output, comprises:

9. The apparatus according to one of claims 6-8, wherein in the sample acquisition unit, the training sample set is generated by:

10. An apparatus for recognizing speech, comprising:

a voice receiving unit configured to receive in-vehicle voice data;

a speech recognition unit configured to input the vehicle-mounted speech data into a vehicle-mounted acoustic model generated using the method according to one of claims 1 to 4, resulting in a vehicle-mounted speech recognition result corresponding to the vehicle-mounted speech data.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.