CN111862945A

CN111862945A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111862945A
Application number: CN201910412514.0A
Authority: CN
Inventors: 赵帅江; 赵茜; 罗讷
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2020-10-30

Abstract

The present application relates to the field of vehicle speech recognition technologies, and in particular, to a speech recognition method, apparatus, electronic device, and storage medium, where the method includes: acquiring a sample voice signal of a target service provider and an identified text corresponding to the sample voice signal; retraining the pre-trained voice recognition model based on the sample voice signal of the target service provider and the corresponding recognized text to obtain a special voice recognition model of the target service provider; after the voice signal to be recognized of the target service provider is obtained, the voice signal to be recognized is input into a voice recognition model special for the target service provider, and a recognition text is obtained. By adopting the scheme, the targeted voice recognition can be performed on the target driver under the specific environmental factors based on the special voice recognition model, and the recognition accuracy is higher.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of vehicle voice recognition technologies, and in particular, to a voice recognition method, an apparatus, an electronic device, and a storage medium.

Background

With the progress of science and technology, the voice recognition technology is continuously popularized to the field of automobile industry from consumer electronics products to help drivers to solve the trouble of inconvenient manual operation in the driving process, such as call receiving and making, navigation control and the like, so that a safer and quicker operation method is brought. The network car booking service is a prominent application of the voice recognition technology in the field of automobile industry, and currently, voice recognition research aiming at the application of the network car booking service is carried out.

The related voice recognition technology needs to receive a voice signal input by a driver by using a microphone, perform recognition processing on the received voice signal, convert the received voice signal into character string data, and execute corresponding automobile control operation according to the converted character string data. Based on this, it is necessary to recognize the voice signal of the driver with high accuracy so as to avoid the command execution error and influence on safe driving.

Disclosure of Invention

In view of this, an object of the present application is to provide a voice recognition method, an apparatus, an electronic device, and a storage medium, which can perform targeted recognition on the voice of a target driver, and the recognition accuracy is high.

Mainly comprises the following aspects:

in a first aspect, an embodiment of the present application provides a speech recognition method, where the method includes:

acquiring a sample voice signal of a target service provider and an identified text corresponding to the sample voice signal;

retraining the pre-trained voice recognition model based on the sample voice signal of the target service provider and the corresponding recognized text to obtain a special voice recognition model of the target service provider; the pre-trained voice recognition model is obtained by training based on sample voice signals of a plurality of service providers and corresponding recognized texts and sample voice signals of a plurality of service requesters and corresponding recognized texts;

after the voice signal to be recognized of the target service provider is obtained, the voice signal to be recognized is input into a voice recognition model special for the target service provider, and a recognition text is obtained.

In one embodiment, before the retraining the pre-trained speech recognition model based on the sample speech signal of the target service provider and the corresponding recognized text to obtain the speech recognition model specific to the target service provider, the method further includes:

Acquiring at least one type of in-vehicle environment information of a target vehicle used by the target service provider;

performing voice processing on the sample voice signal of the target service provider based on the acquired at least one type of in-vehicle environment information to obtain a processed sample voice signal;

the retraining the pre-trained speech recognition model based on the sample speech signal of the target service provider and the corresponding recognized text to obtain the speech recognition model special for the target service provider comprises:

and retraining the pre-trained voice recognition model based on the processed sample voice signal and the recognized text corresponding to the sample voice signal to obtain the voice recognition model special for the target service provider.

In some embodiments, the in-vehicle environmental information includes one of the following:

noise information generated by the target vehicle in a driving state;

actual position information of each vehicle component and relative position information among the vehicle components, which are arranged in the target vehicle;

and the position information of a voice receiving device arranged in the target vehicle.

In another embodiment, training a speech recognition model based on sample speech signals and corresponding recognized text of a plurality of service providers and sample speech signals and corresponding recognized text of a plurality of service requesters comprises:

Extracting voice features from the sample voice signals of the service provider and the service requester;

and taking the voice features extracted from the sample voice signal as the input of a voice recognition model to be trained, taking the recognized text corresponding to the sample voice signal as the output of the voice recognition model to be trained, and training to obtain the voice recognition model.

In yet another embodiment, the speech recognition model includes an encoder model, an attention model, and a decoder model; the training of the speech recognition model by using the speech features extracted from the sample speech signal as the input of the speech recognition model to be trained and using the recognized text corresponding to the sample speech signal as the output of the speech recognition model to be trained comprises:

inputting voice features extracted from a sample voice signal into an encoder model to be trained to obtain a voice encoding vector corresponding to the voice features;

inputting the voice coding vector corresponding to the voice feature into an attention model to be trained, and outputting an attention feature vector according to a weighted summation operation result between an attention parameter value of the attention model to be trained and the voice coding vector;

Inputting the attention feature vector into a decoder model to be trained, and obtaining a recognized text output by the decoder model to be trained according to the attention feature vector and historical node state information of the decoder model; and comparing the output recognized text with the recognized text corresponding to the sample voice signal until the matching degree between the output recognized text and the recognized text corresponding to the sample voice signal reaches a preset threshold value, and stopping training.

In some embodiments, the outputting an attention feature vector according to a result of a weighted summation operation between an attention parameter value of the attention model to be trained and the speech coding vector includes:

for each speech coding value in the speech coding vector, determining an attention parameter value of the speech coding value from the attention parameter values of the attention model to be trained;

carrying out weighted summation operation on each voice coding value and the attention parameter value of each voice coding value to obtain an attention characteristic value corresponding to each output word of the decoder model;

and combining the attention characteristic values corresponding to the output words of the decoder model in sequence to obtain an attention characteristic vector.

In some embodiments, the obtaining the recognized text output by the decoder model to be trained according to the attention feature vector and the historical node state information of the decoder model includes:

aiming at each word to be output of the decoder model to be trained, obtaining the word output by the decoder model to be trained according to the attention characteristic value corresponding to the word to be output and the historical node state information of the decoder model corresponding to the word to be output;

and sequentially combining the output words to obtain the recognized text output by the decoder model to be trained.

In yet another embodiment, the method further comprises:

extracting voice characteristics from the obtained sample voice signal of the service provider to be authenticated;

and inputting the voice characteristics into an identity authentication model which is trained in advance for a target service provider, performing identity authentication on a service provider to be authenticated, and determining the service provider to be authenticated as the target service provider after the authentication is passed.

In some embodiments, there are a plurality of sample voice signals of the target service provider; training the identity authentication model according to the following steps:

Extracting voice features from each sample voice signal of a target service provider;

and sequentially taking each voice characteristic as the input of the identity authentication model to be trained, and taking the identity recognition determination result corresponding to the voice characteristic as the output of the identity authentication model to be trained to obtain the trained identity authentication model.

In a second aspect, an embodiment of the present application further provides a speech recognition method, where the method includes:

and sending the voice recognition model to the electronic equipment needing voice recognition, so that the electronic equipment can recognize the voice signal to be recognized based on the voice recognition model to obtain a recognized text.

In a third aspect, an embodiment of the present application further provides a speech recognition apparatus, where the apparatus includes:

the system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a sample voice signal of a target service provider and a recognized text corresponding to the sample voice signal;

the training module is used for retraining a pre-trained voice recognition model based on the sample voice signal of the target service provider and the corresponding recognized text to obtain a special voice recognition model of the target service provider; the pre-trained voice recognition model is obtained by training based on sample voice signals of a plurality of service providers and corresponding recognized texts and sample voice signals of a plurality of service requesters and corresponding recognized texts;

and the recognition module is used for inputting the voice signal to be recognized into a special voice recognition model of the target service provider after the voice signal to be recognized of the target service provider is obtained, so as to obtain a recognition text.

In an embodiment, the training module is specifically configured to:

acquiring at least one type of in-vehicle environment information of a target vehicle used by the target service provider before retraining a pre-trained voice recognition model based on the sample voice signal of the target service provider and a corresponding recognized text to obtain a dedicated voice recognition model of the target service provider;

noise information generated by the target vehicle in a driving state;

In another embodiment, the training module is further configured to:

In yet another embodiment, the speech recognition model includes an encoder model, an attention model, and a decoder model;

the training module is specifically configured to:

In some embodiments, the training module is specifically configured to:

In yet another embodiment, the apparatus further comprises:

the authentication module is used for extracting voice characteristics from the acquired sample voice signal of the service provider to be authenticated;

In some embodiments, there are a plurality of sample voice signals of the target service provider;

the authentication module is further used for extracting voice features from each sample voice signal of the target service provider;

In a fourth aspect, an embodiment of the present application further provides a speech recognition apparatus, where the apparatus includes:

And the recognition module is used for sending the voice recognition model to the electronic equipment needing voice recognition so that the electronic equipment can recognize the voice signal to be recognized based on the voice recognition model to obtain a recognition text.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the speech recognition method according to the first aspect.

In a sixth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the speech recognition method according to the first aspect.

By adopting the scheme, the pre-trained general voice recognition model can be retrained through the acquired sample voice signal of the target service provider and the recognized text corresponding to the sample voice signal to obtain the voice recognition model special for the target service provider, so that the voice recognition can be performed on the target service provider in a targeted manner based on the special voice recognition model, for example, the voice recognition model special for the target driver can be obtained by acquiring the voice signal of the target driver in the driving process of a common driving vehicle and the corresponding recognized text provided by the target driver by considering the influence of specific environmental factors of different drivers, so that the targeted voice recognition can be performed on the target driver under the specific environmental factors based on the special voice recognition model, the accuracy of the identification is high.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic diagram illustrating an architecture of a service system provided in an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a flow chart of a speech recognition method provided in the second embodiment of the present application;

FIG. 4 is a flow chart of a speech recognition method provided in the third embodiment of the present application;

FIG. 5 is a flow chart of another speech recognition method provided in the third embodiment of the present application;

FIG. 6 is a flow chart of a speech recognition method according to the fourth embodiment of the present application;

fig. 7 is a flowchart illustrating a speech recognition method according to a fifth embodiment of the present application;

fig. 8 is a schematic structural diagram illustrating a speech recognition apparatus according to a sixth embodiment of the present application;

Fig. 9 is a schematic structural diagram of another speech recognition apparatus provided in a sixth embodiment of the present application;

fig. 10 shows a schematic structural diagram of an electronic device according to a seventh embodiment of the present application;

fig. 11 shows a schematic structural diagram of another electronic device provided in the seventh embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

In order to enable a person skilled in the art to use the present disclosure, the following embodiments are given in connection with a specific application scenario "speech recognition in a car appointment service". It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the present application is described primarily in the context of speech recognition in a network appointment service, it should be understood that this is merely one exemplary embodiment.

It should be noted that the embodiment of the present application may be applied to an application scenario of voice recognition in the network appointment service, and may also be applied to an application scenario of voice recognition in other services (such as order delivery service, logistics delivery service, and the like), and details of which are not described herein again. Furthermore, the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features hereinafter claimed, but not to exclude the addition of further features.

The terms "passenger," "requestor," "service requestor," and "customer" are used interchangeably in this application to refer to an individual, entity, or tool that can request or order a service. The terms "driver," "provider," "service provider," and "provider" are used interchangeably in this application to refer to an individual, entity, or tool that can provide a service. The term "user" in this application may refer to an individual, entity or tool that requests a service, subscribes to a service, provides a service, or facilitates the provision of a service. For example, the user may be a passenger, a driver, an operator, etc., or any combination thereof. In the present application, "passenger" and "passenger terminal" may be used interchangeably, and "driver" and "driver terminal" may be used interchangeably.

The terms "service request" and "order" are used interchangeably herein to refer to a request initiated by a passenger, a service requester, a driver, a service provider, or a supplier, the like, or any combination thereof. Accepting the "service request" or "order" may be a passenger, a service requester, a driver, a service provider, a supplier, or the like, or any combination thereof. The service request may be charged or free.

One aspect of the present application relates to a service system. The system can retrain the pre-trained universal speech recognition model again through the acquired sample speech signal of the target service provider and the recognized text corresponding to the sample speech signal, so as to obtain the special speech recognition model of the target service provider.

It is noted that the speech recognition technology in the related art has low recognition accuracy before the present application is filed. However, the service system provided by the application can perform targeted recognition on the voice of the driver by using the trained voice recognition model special for the target service provider, and the recognition accuracy is high.

Fig. 1 is a schematic architecture diagram of a service system according to an embodiment of the present application. For example, the service system may be an online transportation service platform for transportation services such as taxi cab, designated drive service, express, carpool, bus service, driver rental, or regular service, or any combination thereof. The service system 100 may include one or more of a server 101, a network 102, a service requester terminal 103, a service provider terminal 104, and a database 105.

In some embodiments, the server 101 may include a processor. The processor may process information and/or data related to the service request to perform one or more of the functions described herein. For example, the processor may determine the target vehicle based on a service request obtained from the service requester terminal 103. In some embodiments, a processor may include one or more processing cores (e.g., a single-core processor (S) or a multi-core processor (S)). Merely by way of example, a Processor may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a reduced Instruction Set computer (reduced Instruction Set computer), a microprocessor, or the like, or any combination thereof.

In some embodiments, the device types corresponding to the service requester terminal 103 and the service provider terminal 104 may be mobile devices, such as smart home devices, wearable devices, smart mobile devices, virtual reality devices, augmented reality devices, and the like, and may also be tablet computers, laptop computers, built-in devices in motor vehicles, and the like.

In some embodiments, a database 105 may be connected to the network 102 to communicate with one or more components in the service system (e.g., the server 101, the service requester terminal 103, the service provider terminal 104, etc.). One or more components in the service system may access data or instructions stored in database 105 via network 102. In some embodiments, the database 105 may be directly connected to one or more components in the service system, or the database 105 may be part of the server 101.

The following describes the speech recognition method provided in the embodiment of the present application in detail with reference to the content described in the service system shown in fig. 1.

Example one

Referring to fig. 2, a flow diagram of a speech recognition method provided in an embodiment of the present application is shown, where the method may be executed by a server in a service system, and may also be executed by a service provider terminal in the service system, and the specific execution process is as follows:

S201, obtaining a sample voice signal of a target service provider and a recognized text corresponding to the sample voice signal.

Here, it is considered that the voice recognition method provided by the embodiment of the present application may be applied to an application scenario of voice recognition in a car-to-order service, such that the target service provider may be a service provider to perform voice recognition, the target service provider may correspond to a fixed vehicle (i.e., a target vehicle), and a sample voice signal of the target service provider may be obtained from a specific number protection platform, may also be obtained from a car recorder that records a driving condition of the vehicle, and may also be obtained from other channels, which is not limited in particular by the embodiment of the present application.

For the number protection platform, the number protection platform can be integrated under a network car booking service platform for application, and can also be applied independently, real mobile phone numbers of a driver and a passenger are hidden through schemes such as direct dialing, call-back, a middle number and a small number, privacy of the two parties is protected, moreover, the driver and the passenger complete voice communication through the number protection platform, the number protection platform guarantees rights and interests of the two parties through call recording, and therefore the number protection platform can obtain sample voice signals of a related target service provider through the call recording.

For a car event data recorder, the car event data recorder can use a built-in microphone to collect a voice signal sent by a driver, that is, a sample voice signal of a target service provider can be acquired through a sound receiving device such as a microphone.

The sample voice signal can be a historically collected voice signal regardless of whether the sample voice signal is obtained by a number protection platform and is obtained by a vehicle data recorder. Thus, for the historically collected speech signal, the embodiment of the present application determines the recognized text corresponding to the sample speech signal, where the labeled recognized text may be directly used as the recognized text, and the recognized text corresponding to the sample speech signal may also be determined by using the existing speech recognition technology.

S202, based on the sample voice signal of the target service provider and the corresponding recognized text, retraining the pre-trained voice recognition model to obtain the special voice recognition model of the target service provider.

Here, in consideration of a specific application scenario of the voice recognition method provided by the embodiment of the present application, in a cab where a target service provider provides a network appointment service, the accuracy of voice recognition is often low due to the influence of background sounds such as voices of passengers, audio sounds, noises during vehicle driving, and the like. It is to solve the above problem that the embodiments of the present application provide a training scheme for a speech recognition model dedicated to a target service provider. Here, the voice feature extracted from the sample voice signal of the target service provider may be used as an input of a pre-trained voice recognition model, and the recognized text corresponding to the sample voice signal may be used as an output of the pre-trained voice recognition model to re-train the voice recognition model at the training site, so as to obtain the voice recognition model dedicated to the target service provider.

In this embodiment of the application, the pre-trained speech recognition model may be trained based on sample speech signals of a plurality of service providers and corresponding recognized texts, and sample speech signals of a plurality of service requesters and corresponding recognized texts.

It should be noted that, the acquisition of the sample voice signal of the service provider and the sample voice signal of the service requester can be determined by the above-mentioned acquisition manner of the sample voice signal of the target service provider, and details thereof are not repeated herein.

In the speech recognition model training phase, speech features extracted from sample speech signals of a plurality of service providers may be used as input of a speech recognition model to be trained, recognized texts corresponding to the sample speech signals of the plurality of service providers are used as output, speech features extracted from the sample speech signals of a plurality of service requesters are used as input of the speech recognition model to be trained, and the recognized texts corresponding to the sample speech signals of the plurality of service requesters are used as output to be trained to obtain parameter information of the speech recognition model, and the like, that is, the trained speech recognition model is obtained. An end-to-end (Encoder-Decoder) model can be adopted as a voice recognition model in the embodiment of the application.

In particular implementations, the speech recognition model maps a speech feature to a text sequence (i.e., recognized text). The embodiment of the application can adopt a special type of Encode-Decoder model, namely a combination model of a Recurrent Neural Network (RNN) model and an Attention (Attention) model. On one hand, the embodiment of the application can predict the acoustic data (namely the voice features) into the voice coding vector through the RNN model, thereby ensuring the prediction accuracy and providing application convenience for the attention model. On the other hand, the embodiment of the present application may perform alignment between the acoustic frame and the recognized symbol through an attention model, that is, the attention model can use some expressions in the network to find some inputs related to the prediction output from the Encoder, and the closer the relationship is, the larger the value of the weight vector is, so that the RNN model used by the Decoder can obtain an additional vector helpful for the current prediction output, thereby avoiding the forgetting problem of long sequences. In the embodiment of the application, various basic knowledge is gradually mastered by adopting the combined model through repeated iterative learning, and finally how to generate a corresponding text sequence according to one voice characteristic is learned.

Similarly, the process of retraining the trained speech recognition model based on the sample speech signal of the target service provider and the corresponding recognized text is similar to the training process described above, and is not described herein again.

S203, after the voice signal to be recognized of the target service provider is obtained, inputting the voice signal to be recognized into a voice recognition model special for the target service provider to obtain a recognition text.

Here, in the embodiment of the present application, by performing voice feature extraction on the obtained to-be-recognized voice signal of the target service provider, a corresponding voice feature may be obtained, and the voice feature is input to a voice recognition model dedicated to the target service provider obtained through training, so that a text sequence, that is, a recognition text, corresponding to the to-be-recognized voice signal may be output. Therefore, the pre-trained voice recognition model can be used for quickly recognizing the voice of the target service provider, and the recognition efficiency and accuracy are high.

When the speech recognition method provided by the embodiment of the application is applied to the application scenario of the network appointment service, the speech recognition accuracy may be affected by various complex environments, such as noise generated by vehicle driving. The embodiment of the application provides a scheme for processing the sample voice signal of the target service provider by using the in-vehicle environment confidence just considering the influence of the complex environment on the voice recognition accuracy of the target service provider, so as to expand the input sample under the premise of comprehensively considering the influence of the complex environment on the voice recognition accuracy. The following embodiment will explain the process of speech processing.

Example two

As shown in fig. 3, a method for processing speech provided in the embodiment of the present application specifically includes the following steps:

s301, acquiring at least one type of in-vehicle environment information of a target vehicle used by a target service provider;

s302, carrying out voice processing on the sample voice signal of the target service provider based on the acquired at least one type of in-vehicle environment information to obtain a processed sample voice signal.

Here, the in-vehicle environment information in the embodiment of the present application may be noise information generated by the target vehicle in a traveling state, may also be actual position information of each vehicle component provided in the target vehicle and relative position information between the vehicle components, may also be position information of a voice receiving device provided in the target vehicle, and may also be other in-vehicle environment information. In this way, at least one piece of in-vehicle environment information may be any one of the above information, or may be a combination of any two or three of the above information, which is not specifically limited in this embodiment of the application.

The relevant noise information may refer to various noises generated by the target vehicle during driving, such as fetal noises, and may also be related to human beings, such as artificial noises caused by a passenger speaking a sample voice of the target service provider. The information about the actual position and the relative position is mainly used for simulating reverberation generated by voice interaction in the vehicle to determine the influence of the reverberation on voice signals, and the reverberation can embody the spatial environmental characteristics in the cab. The position information about the voice receiving device is used for determining the distance between the occurrence position of the target service provider and the voice receiving device so as to determine the influence of the distance on the voice signal.

It is worth mentioning that, in consideration of the influence of the vehicle type information on the in-vehicle environment information, the in-vehicle environment information of the target vehicle can be determined based on the determination of the vehicle type information of the target vehicle.

In the embodiment of the application, according to the environmental information in the car, can do the enhancement extension on the data to the sample speech signal who draws the driver, can simulate driver's pronunciation under the different conditions, for example can simulate driver's pronunciation under different car noises, different car interior position condition, in addition, in order to carry out the pronunciation extension as far as, the environmental information in the car under the condition such as different road conditions, different routes, different speeds of a motor vehicle can also be acquireed in order to realize the pronunciation extension to greatly alleviated the not enough problem of training data in this application embodiment. Therefore, the sample voice signals after the simulation processing are led into the trained voice recognition model for retraining, the model is optimized and fine-tuned, the personalized voice recognition model of a specific driver in a specific in-vehicle environment can be obtained by using less voice data, and the accuracy of voice recognition for the driver is further improved.

Considering that the target service provider-specific speech recognition model provided in the embodiment of the present application is obtained by training again depending on a general speech recognition model, a specific description of the training method of the general speech recognition model is provided next by the following third embodiment.

EXAMPLE III

As shown in fig. 4, a flowchart of a training method of a speech recognition model provided in the present application is provided, where the training method includes the following steps:

s401, extracting voice characteristics from sample voice signals of a service provider and a service requester;

s402, taking the voice features extracted from the sample voice signals as the input of the voice recognition model to be trained, taking the recognized text corresponding to the sample voice signals as the output of the voice recognition model to be trained, and training to obtain the voice recognition model.

Here, in the embodiment of the present application, the speech feature may be used to characterize the sample speech signal. The embodiment of the application can firstly frame the sample voice signal. After framing the sample speech signal, spectral analysis may be performed first. In this way, for a sample speech sub-signal obtained by framing, the corresponding frequency domain signal can be split into a product of two parts, that is, an envelope of the spectrum and a detail of the spectrum, where the former corresponds to low frequency information of the spectrum, and the detail corresponds to high frequency information of the spectrum, so that the sample speech signal can be characterized by using the obtained speech features.

The process of training a speech recognition model is a process of training the internal parameters of the speech recognition model. The process of performing the speech recognition model training in the embodiment of the application is a cyclic process, the speech recognition model can be obtained only by at least one round of model training, and the training is stopped when the recognized text output by the model is consistent with the recognized text corresponding to the sample speech signal or after the preset convergence times of the model is reached.

In the process of each round of model training, the embodiment of the application may first input the speech features extracted from the sample speech signal into the speech recognition model to be trained, output the recognition text corresponding to the sample speech signal, then determine whether the output recognition text is consistent with the actual recognition text corresponding to the sample speech signal, when it is determined that the output recognition text is inconsistent with the actual recognition text corresponding to the sample speech signal, adjust the internal parameters in the speech recognition model, and perform the next round of model training based on the adjusted internal parameters, so as to loop until a preset convergence condition is reached (e.g., the recognition text output by the model is consistent with the actual recognition text or the training times reach the preset convergence times), and obtain the speech recognition model.

In a specific application, the speech recognition model may include an encoder model, an attention model and a decoder model, so that the built-in parameters of the three models can be jointly trained. As shown in fig. 5, a flowchart of a training method for a speech recognition model provided in an embodiment of the present application is provided, where the training method specifically includes the following steps:

s501, inputting voice features extracted from a sample voice signal into an encoder model to be trained to obtain voice encoding vectors corresponding to the voice features;

s502, inputting the voice coding vector corresponding to the voice feature into an attention model to be trained, and outputting an attention feature vector according to a weighted summation operation result between an attention parameter value of the attention model to be trained and the voice coding vector;

s503, inputting the attention feature vector into a decoder model to be trained, and obtaining a recognized text output by the decoder model to be trained according to the attention feature vector and the historical node state information of the decoder model; and comparing the output recognized text with the recognized text corresponding to the sample voice signal until the matching degree between the output recognized text and the recognized text corresponding to the sample voice signal reaches a preset threshold value, and stopping training.

Here, in the embodiment of the present application, the speech features extracted from the sample speech signal may be first input into the encoder model to obtain the speech coding vector corresponding to the speech features, then the speech coding vector corresponding to the speech features is input into the attention model, the attention feature vector is output according to the weighted summation operation result between the attention parameter value of the attention model and the speech coding vector, finally the attention feature vector is input into the encoder model, the recognized text output by the decoder model is obtained according to the attention feature vector and the historical node state information of the decoder model, and when it is determined that the matching degree between the output recognized text and the recognized text corresponding to the sample speech signal reaches the preset threshold, the training may be stopped.

In the process of determining the internal parameters, the embodiment of the application considers the degree of association between text words included in the recognized text and the sample voice sub-signals of each frame. Since the sample speech sub-signals are encoded by the encoder in the embodiment of the present application, the embodiment of the present application may determine the degree of association between the text word included in the recognized text and each frame of sample speech sub-signal according to the degree of association between the word included in the recognized text and each speech encoding value, that is, increase the attention mechanism. In the embodiment of the application, the obtained multiple speech coding values selectively concerned by the words included in the recognized text can be utilized, that is, the speech features corresponding to the multi-frame sample speech sub-signals can be selectively concerned, and the attention feature vector can be obtained.

In the embodiment of the present application, the attention feature vector may be determined as follows.

Step one, aiming at each voice coding value in the voice coding vector, determining an attention parameter value of the voice coding value from the attention parameter value of the attention model to be trained;

step two, carrying out weighted summation operation on each voice coding value and the attention parameter value of each voice coding value to obtain an attention characteristic value corresponding to each output word of the decoder model;

and thirdly, combining the attention characteristic values corresponding to the output words of the decoder model in sequence to obtain an attention characteristic vector.

In the embodiment of the present application, for each speech coding value in the speech coding vector, first, an attention parameter value of the speech coding value is determined from the attention parameter values of the attention model to be trained, that is, determining an attention parameter value for each speech coding value, then performing weighted summation operation on each speech coding value and the attention parameter value of each speech coding value to obtain an attention feature value corresponding to each output word of the decoder model, and finally combining the attention feature values corresponding to each output word of the decoder model in sequence to obtain an attention feature vector, that is, for each output word, the attention parameter value of each speech coding value corresponding to the output word can be determined, and the corresponding attention feature vector can be obtained through weighted summation operation. It should be noted that, for different output words, the distribution of the attention parameter values of the corresponding speech coding values may be different, that is, the influence of the initially input sample speech sub-signals on the words of the finally output recognized text may be different, so as to further improve the accuracy of speech recognition.

In the embodiment of the present application, the recognized text output by the decoder model to be trained may be determined as follows.

Step one, aiming at each word to be output of the decoder model to be trained, obtaining the word output by the decoder model to be trained according to the attention characteristic value corresponding to the word to be output and the historical node state information of the decoder model corresponding to the word to be output;

and step two, sequentially combining the output words to obtain the recognized text output by the decoder model to be trained.

In this embodiment of the application, for each word to be output of the to-be-trained decoder model, the word output by the to-be-trained decoder model may be obtained according to the attention feature value corresponding to the word to be output and the historical node state information of the decoder model corresponding to the word to be output, and then the output words are combined in sequence to obtain the recognized text output by the to-be-trained decoder model.

In the embodiment of the present application, each word to be output corresponds to the historical node state information of one decoder model, taking 3 words to be output as an example, for the 1 st word to be output, the historical node state information may be the initial node state information of the decoder, and for the 2 nd word to be output, the historical node state information may be the node state information of the decoder after the 1 st word is output, and similarly, for the 3 rd word to be output, the historical node state information may be the node state information of the decoder after the 2 nd word is output.

It should be noted that, for performing voice recognition of a target service provider by using a dedicated voice recognition model, the embodiment of the present application may also fully utilize communication content of the driver in a driver-driver communication environment (for example, the service provider utilizes text communication content performed by a service provider terminal and a service requester) to verify the output recognized text, that is, the embodiment of the present application may perform voice recognition on the target service provider by combining a text recognition model on the basis of the current dedicated voice recognition model, thereby further improving the accuracy of voice recognition.

In order to further ensure the accuracy of the special voice recognition model trained for the target service provider, the embodiment of the application can also authenticate the identity of the target service provider, and besides adopting the ways of voiceprint authentication, face authentication and the like, the embodiment of the application can also authenticate the identity based on the pre-trained identity authentication model. This is specifically illustrated by the following example four.

Example four

As shown in fig. 6, an embodiment of the present application provides an identity authentication method, which specifically includes the following steps:

s601, extracting voice characteristics from the acquired sample voice signal of the service provider to be authenticated;

S602, inputting the voice characteristics into an identity authentication model trained in advance for a target service provider, performing identity authentication on a service provider to be authenticated, and determining the service provider to be authenticated as the target service provider after the authentication is passed.

Here, the voice feature may be first extracted from a sample voice signal of the service provider to be authenticated, and then the extracted voice feature may be input into an identity authentication model trained in advance for the target service provider to perform identity authentication. In the application scenario of voice recognition in the online car booking service, the service provider to be authenticated can be a service provider which is subjected to information preliminary authentication (such as mobile phone number recognition and platform account recognition) after accessing the online car booking service platform, so that accurate identity authentication can be performed on a target service provider within a search range of the reduced service provider.

The identity authentication model can be trained according to the following steps:

step one, aiming at each sample voice signal of a target service provider, extracting voice characteristics from the sample voice signal;

and step two, sequentially taking each voice feature as the input of the identity authentication model to be trained, and taking the identity recognition determination result corresponding to the voice feature as the output of the identity authentication model to be trained to obtain the trained identity authentication model.

Here, the voice feature extracted from each sample voice signal of the target service provider may include an FBank feature (a feature obtained by performing fast fourier transform on a voice signal), an MFCC feature (a feature obtained by performing discrete cosine transform on a voice signal after performing fast fourier transform), and the like, so as to embody the speaking mode of the target service provider. Therefore, the identity authentication of the service provider to be authenticated can be carried out by training the identity authentication model of the target service provider. Here, it should be noted that, since the identity authentication model is trained for the target service provider, it is possible to directly eliminate noise and other passenger sounds, and screen out the voice belonging to the driver, so as to better ensure the driver's use.

Another speech recognition method provided by the embodiment of the present application is described in detail with reference to the content described in the service system shown in fig. 1.

EXAMPLE five

Referring to fig. 7, a schematic flow chart of a speech recognition method provided in the fifth embodiment of the present application is shown, where the method may be executed by a server in a service system, and the specific execution process is as follows:

S701, acquiring a sample voice signal of a target service provider and a recognized text corresponding to the sample voice signal;

s702, retraining a pre-trained voice recognition model based on the sample voice signal of the target service provider and the corresponding recognized text to obtain a special voice recognition model of the target service provider;

and S703, sending the voice recognition model to the electronic equipment needing voice recognition so that the electronic equipment can recognize the voice signal to be recognized based on the voice recognition model to obtain a recognition text.

Here, similar to the speech recognition method provided in the first embodiment of the present application, the speech recognition method provided in the first embodiment of the present application may also perform retraining on a pre-trained speech recognition model based on the obtained sample speech signals of the target service provider and the corresponding recognized text, so as to obtain a speech recognition model dedicated to the target service provider, where the pre-trained speech recognition model is obtained by training based on the sample speech signals of a plurality of service providers and the corresponding recognized texts, and the specific training process is described in the first embodiment of the present application, and is not repeated here. Different from the voice recognition method provided in the first embodiment of the present application, after the server trains and obtains the dedicated voice recognition model for the target service provider, the dedicated voice recognition model may be sent to the electronic device (e.g., the target service provider terminal) that needs to perform voice recognition, so that the electronic device recognizes the to-be-recognized voice signal of the target service provider based on the voice recognition model to obtain the recognition text.

Based on the same inventive concept, the embodiment of the present application further provides a speech recognition apparatus corresponding to the speech recognition method, and since the principle of the apparatus in the embodiment of the present application for solving the problem is similar to the speech recognition method in the embodiment of the present application, the implementation of the apparatus can refer to the implementation of the method, and repeated details are not repeated.

EXAMPLE six

Referring to fig. 8, a schematic diagram of a speech recognition apparatus according to a fifth embodiment of the present application is shown, where the apparatus includes:

an obtaining module 801, configured to obtain a sample voice signal of a target service provider and an identified text corresponding to the sample voice signal;

a training module 802, configured to train a pre-trained speech recognition model again based on the sample speech signal of the target service provider and the corresponding recognized text, to obtain a dedicated speech recognition model for the target service provider; the pre-trained voice recognition model is obtained by training based on sample voice signals of a plurality of service providers and corresponding recognized texts and sample voice signals of a plurality of service requesters and corresponding recognized texts;

the recognition module 803 is configured to, after obtaining the to-be-recognized voice signal of the target service provider, input the to-be-recognized voice signal into a voice recognition model dedicated to the target service provider, so as to obtain a recognition text.

In an embodiment, the training module 802 is specifically configured to:

noise information generated by the target vehicle in a driving state;

In another embodiment, the training module 802 is further configured to:

the training module 802 is specifically configured to:

In some embodiments, the training module 802 is specifically configured to:

In yet another embodiment, the apparatus further comprises:

The authentication module 804 is used for extracting voice features from the acquired sample voice signals of the service provider to be authenticated;

the authentication module 804 is further configured to, for each sample voice signal of the target service provider, extract a voice feature from the sample voice signal;

According to the embodiment of the application, the pre-trained general voice recognition model can be retrained again through the acquired sample voice signal of the target service provider and the recognized text corresponding to the sample voice signal to obtain the special voice recognition model of the target service provider, so that the target service provider can be subjected to targeted voice recognition based on the special voice recognition model, and the recognition accuracy is high.

Referring to fig. 9, a schematic diagram of another speech recognition apparatus provided in a sixth embodiment of the present application, the apparatus includes:

an obtaining module 901, configured to obtain a sample voice signal of a target service provider and an identified text corresponding to the sample voice signal;

a training module 902, configured to train a pre-trained speech recognition model again based on the sample speech signal of the target service provider and the corresponding recognized text, to obtain a dedicated speech recognition model for the target service provider; the pre-trained voice recognition model is obtained by training based on sample voice signals of a plurality of service providers and corresponding recognized texts and sample voice signals of a plurality of service requesters and corresponding recognized texts;

the recognition module 903 is configured to send the speech recognition model to an electronic device that needs to perform speech recognition, so that the electronic device recognizes a speech signal to be recognized based on the speech recognition model to obtain a recognition text.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

EXAMPLE seven

An electronic device according to a seventh embodiment of the present application is provided, as shown in fig. 10, and is a schematic structural diagram of the electronic device according to the embodiment of the present application, including: a processor 1001, a storage medium 1002, and a bus 1003. The storage medium 1002 stores machine-readable instructions (for example, execution instructions corresponding to the obtaining module 801, the training module 802, and the identifying module 803 in the apparatus in fig. 8, etc.) executable by the processor 1001, when the electronic device is running, the processor 1001 and the storage medium 1002 communicate via the bus 1003, and when the processor 1001 executes the following processes:

In an embodiment, before the retraining the pre-trained speech recognition model based on the sample speech signal of the target service provider and the corresponding recognized text to obtain the speech recognition model dedicated to the target service provider, the instructions executed by the processor 1001 further include:

in the instructions executed by the processor 1001, the retraining the pre-trained speech recognition model based on the sample speech signal of the target service provider and the corresponding recognized text to obtain the speech recognition model dedicated to the target service provider includes:

noise information generated by the target vehicle in a driving state;

In another embodiment, the instructions executed by the processor 1001 to obtain a speech recognition model based on the sample speech signals and the corresponding recognized texts of the multiple service providers and the sample speech signals and the corresponding recognized texts of the multiple service requesters include:

In yet another embodiment, the speech recognition model includes an encoder model, an attention model, and a decoder model; in the instructions executed by the processor 1001, the training of the speech recognition model by using the speech features extracted from the sample speech signal as the input of the speech recognition model to be trained and using the recognized text corresponding to the sample speech signal as the output of the speech recognition model to be trained includes:

In some embodiments, in the instructions executed by the processor 1001, the outputting an attention feature vector according to a result of a weighted summation operation between an attention parameter value of the attention model to be trained and the speech coding vector includes:

In some embodiments, in the instructions executed by the processor 1001, the obtaining, according to the attention feature vector and the historical node state information of the decoder model, the recognized text output by the decoder model to be trained includes:

In another embodiment, the instructions executed by the processor 1001 further include:

In some embodiments, there are a plurality of sample voice signals of the target service provider; in the instructions executed by the processor 1001, the identity authentication model is trained according to the following steps:

An electronic device is further provided in the seventh embodiment of the present application, as shown in fig. 11, and a schematic structural diagram of the electronic device provided in the embodiment of the present application includes: a processor 1101, a storage medium 1102, and a bus 1103. The storage medium 1102 stores machine-readable instructions executable by the processor 1101 (for example, execution instructions corresponding to the obtaining module 901, the training module 902, and the identifying module 903 in the apparatus in fig. 9, and the like), when the electronic device runs, the processor 1101 communicates with the storage medium 1102 through the bus 1103, and the machine-readable instructions, when executed by the processor 1101, perform the following processes:

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by the processor 1001 or the processor 1101, the computer program performs the steps of the foregoing speech recognition method.

Specifically, the storage medium can be a general storage medium, such as a removable disk, a hard disk, or the like, and when a computer program on the storage medium is executed, the voice recognition method can be executed, so that the targeted voice recognition of the target driver under the specific environmental factors can be achieved, and the recognition accuracy is high.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech recognition, the method comprising:

2. The method according to claim 1, further comprising, before the retraining the pre-trained speech recognition model based on the sample speech signal and the corresponding recognized text of the target service provider to obtain the speech recognition model specific to the target service provider:

3. The method of claim 2, wherein the in-vehicle environmental information comprises one of:

noise information generated by the target vehicle in a driving state;

4. The method of claim 1, wherein training a speech recognition model based on sample speech signals and corresponding recognized text for a plurality of service providers and sample speech signals and corresponding recognized text for a plurality of service requesters comprises:

5. The method of claim 4, wherein the speech recognition models include an encoder model, an attention model, and a decoder model; the training of the speech recognition model by using the speech features extracted from the sample speech signal as the input of the speech recognition model to be trained and using the recognized text corresponding to the sample speech signal as the output of the speech recognition model to be trained comprises:

6. The method of claim 5, wherein outputting an attention feature vector according to a result of a weighted summation operation between an attention parameter value of the attention model to be trained and the speech coding vector comprises:

7. The method of claim 5, wherein obtaining the recognized text output by the decoder model to be trained according to the attention feature vector and the historical node state information of the decoder model comprises:

8. The method according to any one of claims 1 to 7, further comprising:

9. The method of claim 8, wherein there are a plurality of sample voice signals of the target service provider; training the identity authentication model according to the following steps:

10. A method of speech recognition, the method comprising:

11. A speech recognition apparatus, characterized in that the apparatus comprises:

12. A speech recognition apparatus, characterized in that the apparatus comprises:

13. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the speech recognition method according to any one of claims 1 to 10.

14. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the speech recognition method as claimed in any one of the claims 1 to 10.