CN118155604A

CN118155604A - Speech recognition method, system, device, vehicle, electronic equipment and storage medium

Info

Publication number: CN118155604A
Application number: CN202211509971.XA
Authority: CN
Inventors: 陈美玲; 王飞; 赵茜
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2024-06-07

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition system, a voice recognition device, a voice recognition vehicle, electronic equipment and a voice recognition storage medium, and relates to the technical field of computers. After a user is authorized to acquire information and starts a voice man-machine interaction function, the embodiment of the application can acquire the audio to be identified, determine an acoustic identification result corresponding to the audio to be identified according to the pre-trained acoustic identification model, and further determine a voice identification result corresponding to the acoustic identification result according to the pre-trained language processing model. In the process, the embodiment of the application can realize the recognition of various dialects without independently setting a classification algorithm, so that the embodiment of the application can improve the accuracy of voice recognition and the response speed of voice recognition.

Description

Speech recognition method, system, device, vehicle, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, a vehicle, an electronic device, and a storage medium for voice recognition.

Background

With the development of computer technology, the degree of device intelligence is higher and higher, wherein the voice recognition accuracy and the voice recognition speed during voice interaction are important factors for the degree of device intelligence.

In the related art, the man-machine interaction device has higher recognition accuracy for mandarin, but the man-machine interaction device cannot recognize the dialect accurately and efficiently. Therefore, when the user uses dialect and man-machine interaction equipment to interact, the problems of low voice recognition accuracy, low voice recognition response speed and the like may occur.

Disclosure of Invention

In view of the above, embodiments of the present application provide a method, a system, an apparatus, a vehicle, an electronic device, and a storage medium for voice recognition, so as to improve the accuracy of voice recognition and the response speed of voice recognition.

In a first aspect, a method for speech recognition is provided, the method comprising:

and acquiring the audio to be identified.

And determining an acoustic recognition result corresponding to the audio to be recognized according to a pre-trained acoustic recognition model, wherein the acoustic recognition result at least comprises a dialect type and an audio text, the acoustic recognition model at least comprises a decoder and a plurality of encoders, and each encoder is trained based on different dialect training sample sets.

And determining a voice recognition result corresponding to the acoustic recognition result according to a pre-trained language processing model.

In some embodiments, the acoustic recognition model is trained based on the steps of:

A target training set is obtained, wherein the target training set comprises a plurality of sub-training sets, and the sub-training sets at least comprise sample audio corresponding to dialects, labeled text and dialect labels.

And inputting sample audio in the sub-training set into each encoder in the acoustic recognition model, and training the encoder corresponding to the dialect label according to the output result of the encoder corresponding to the dialect label and the labeling text.

And training a decoder in the acoustic recognition model according to the output result of each encoder in the acoustic recognition model and the labeling text.

In some embodiments, the method further comprises:

One or more pieces of audio to be processed are acquired.

And generating a dialect audio dictionary according to the single-word audio in the audio to be processed and the labeling text corresponding to the single-word audio.

And generating a target training set according to the dialect audio dictionary.

In some embodiments, the generating the dialect audio dictionary corresponding to each dialect type according to the single-word audio in the audio to be processed and the labeling text corresponding to the single-word audio includes:

responding to the fact that the pronunciation modes corresponding to the single-word audios with different dialect types and the same labeling text are the same, and generating a dialect audio dictionary corresponding to the dialect types with the same pronunciation modes according to the single-word audios with the same pronunciation modes and the corresponding labeling text.

In some embodiments, the method further comprises:

In response to receiving the audio to be identified, the sample audio, or the audio to be processed, pre-processing the audio to be identified, the sample audio, or the audio to be processed, the pre-processing including at least echo cancellation, audio noise reduction, voice activity detection, and feature extraction.

In some embodiments, the acoustic recognition model further comprises a fully connected layer.

The determining the acoustic recognition result corresponding to the audio to be recognized according to the pre-trained acoustic recognition model comprises the following steps:

inputting the audio to be identified into each encoder in the acoustic identification model to determine the output result of each encoder.

And inputting the output result of each encoder into the full-connection layer to determine the fusion characteristic corresponding to the audio to be identified.

And inputting the fusion characteristic into a decoder in the acoustic recognition model to determine an acoustic recognition result corresponding to the audio to be recognized.

In some embodiments, the determining, according to a pre-trained language processing model, a speech recognition result corresponding to the acoustic recognition result includes:

and determining a target language processing model corresponding to the acoustic recognition result according to the dialect type in the acoustic recognition result.

And determining a voice recognition result corresponding to the acoustic recognition result according to the target language processing model.

In some embodiments, the language processing model is a multi-byte fragment n-gram language processing model.

In a second aspect, there is provided a speech recognition system, the system comprising:

And an audio input output unit configured to perform receiving or playing an audio signal.

A control unit configured to perform the method as described in the first aspect.

In a third aspect, there is provided a vehicle in which the speech recognition system according to the second aspect is provided.

In a fourth aspect, there is provided a speech recognition apparatus, the apparatus comprising:

and the audio to be identified acquisition module is configured to perform acquisition of the audio to be identified.

An acoustic recognition module configured to perform determining an acoustic recognition result corresponding to the audio to be recognized according to a pre-trained acoustic recognition model, the acoustic recognition result including at least a dialect type and an audio text, the acoustic recognition model including at least a decoder and a plurality of encoders, each of the encoders being trained based on a different set of dialect training samples, respectively.

And the language processing module is configured to determine a voice recognition result corresponding to the acoustic recognition result according to a pre-trained language processing model.

In some embodiments, the acoustic recognition model is trained based on the following modules:

The target training set acquisition module is configured to perform acquisition of a target training set, the target training set comprising a plurality of sub-training sets, the sub-training sets comprising at least sample audio of a corresponding dialect, a labeling text, and a dialect label.

And the first training module is configured to perform input of sample audio in the sub-training set into each encoder in the acoustic recognition model, and train the encoder corresponding to the dialect label according to the output result of the encoder corresponding to the dialect label and the labeling text.

A second training module configured to perform training of a decoder in the acoustic recognition model based on the output results of the respective encoders in the acoustic recognition model and the annotation text.

In some embodiments, the apparatus further comprises:

And the audio acquisition module to be processed is configured to acquire one or more pieces of audio to be processed.

And the dialect audio dictionary generating module is configured to execute the generation of the dialect audio dictionary according to the single-word audio in the audio to be processed and the labeling text corresponding to the single-word audio.

And the target training set generating module is configured to generate a target training set according to the dialect audio dictionary.

In some embodiments, the dialect audio dictionary generation module is specifically configured to:

In some embodiments, the apparatus further comprises:

A preprocessing module configured to perform preprocessing of the audio to be identified, the sample audio, or the audio to be processed in response to receiving the audio to be identified, the sample audio, or the audio to be processed, the preprocessing including at least echo cancellation, audio noise reduction, voice activity detection, and feature extraction.

The acoustic identification module is specifically configured to:

In some embodiments, the language processing module is specifically configured to:

In a fifth aspect, an embodiment of the present application provides an electronic device comprising a memory and a processor, the memory storing one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method as described in the first aspect.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method according to the first aspect.

In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program/instruction which, when executed by a processor, implements a method as described in the first aspect.

According to the embodiment of the application, after the audio to be identified is acquired, the audio to be identified can be input into each encoder in the acoustic identification model, and the decoder outputs the acoustic identification result. Because each encoder trains based on different dialect training sample sets, when each encoder processes the same audio to be recognized, the output results of the encoders with the same dialect type can occupy a proportion obviously larger than that of other encoders, so that the acoustic recognition results correspond to the audio to be recognized, namely, the dialect type and the audio text of the audio to be recognized can be accurately recognized. Furthermore, the embodiment of the application can determine the accurate voice recognition result according to the accurate dialect type and the audio text. In the process, the embodiment of the application can realize the recognition of various dialects without independently setting a classification algorithm, so that the embodiment of the application can improve the accuracy of voice recognition and the response speed of voice recognition.

Drawings

The above and other objects, features and advantages of embodiments of the present application will become more apparent from the following description of embodiments of the present application with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a speech recognition system according to an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method according to an embodiment of the application;

FIG. 3 is a flow chart of determining an acoustic recognition result according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another method for determining an acoustic recognition result according to an embodiment of the present application;

FIG. 5 is a flow chart of training an acoustic recognition model in an embodiment of the present application;

FIG. 6 is a flow chart of generating a target training set in an embodiment of the application;

FIG. 7 is a flow chart of generating a dialect audio dictionary in an embodiment of the present application;

FIG. 8 is a schematic view of a vehicle according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a voice recognition device according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the application.

Detailed Description

The present application is described below based on examples, but the present application is not limited to only these examples. In the following detailed description of the present application, certain specific details are set forth in detail. The present application will be fully understood by those skilled in the art without the details described herein. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the application.

Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.

Unless the context clearly requires otherwise, the words "comprise," "comprising," and the like in the description are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, it is the meaning of "including but not limited to".

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more. In addition, the application authorizes to acquire, collect, store, use, process, transmit, provide and present data under the premise of meeting the regulations of related laws and regulations, not invading the privacy of other people and not violating the well-known and popular public order.

With the development of computer technology, the degree of equipment intelligence is higher and higher. For example, in the context of intelligent vehicles, the vehicle center control may act as a human-machine interaction device. After the vehicle is started (i.e., the vehicle center control begins to run), the user can interact with the vehicle center control through voice. Specifically, after the vehicle central controller acquires the user voice, the text corresponding to the user voice can be determined, and actions such as semantic understanding, control instruction generation and the like are executed based on the text.

For another example, in a multi-device linkage scenario (e.g., smart home scenario), a central control device (e.g., smart speaker, etc.) for controlling each sub-device may be used as a human-computer interaction device. After the central control device is started, a user can interact with the central control device through voice. Specifically, after the central control device acquires the user voice, the text corresponding to the user voice can be determined, and actions such as semantic understanding, control instruction generation and the like are executed based on the text.

In the above scenario and similar scenarios, the accuracy of speech recognition and the speed of speech recognition at the time of speech interaction are important considerations for the degree of device intelligentization.

In the related art, the man-machine interaction device has higher recognition accuracy for mandarin, but the man-machine interaction device cannot recognize the dialect accurately and efficiently.

In one case, the man-machine interaction device in the related art is provided with only a voice recognition system for mandarin chinese. That is, when a user interacts with a man-machine interaction device using dialects, the man-machine interaction device may have a problem in that the user's voice cannot be recognized or the user's voice is erroneously recognized.

In another case, a classification algorithm is separately set in the man-machine interaction device in the related art, and when the man-machine interaction device obtains the user voice, the man-machine interaction device classifies the user voice through the classification algorithm, and then selects a corresponding model according to the classification result to recognize the user voice. That is, in this case, a network, model, or algorithm of various functions is provided in the human-computer interaction device, thereby affecting the speed of speech recognition.

Therefore, in the related art, when a user interacts with a man-machine interaction device using dialects, problems such as low accuracy of voice recognition, low response speed of voice recognition, and the like may occur.

In order to solve the above-mentioned problems, an embodiment of the present application provides a voice recognition method for recognizing audio to be recognized through a single acoustic recognition model to determine an acoustic recognition result, thereby determining a voice recognition result according to the acoustic recognition result. The voice recognition method can be applied to electronic equipment and a voice recognition system, wherein the electronic equipment can be a terminal or a server, the terminal can be a smart phone, a tablet personal computer or a personal computer (Personal Computer, PC) and the like, and the server can be a single server, a server cluster configured in a distributed mode or a cloud server.

When the electronic device is a terminal, the electronic device can be used as a man-machine interaction device, the audio to be recognized is collected through the audio collection unit, and the voice recognition method is executed based on the audio to be recognized. When the electronic equipment is a server, the electronic equipment can receive the audio to be identified, which is acquired and sent by the man-machine interaction equipment, through the network, further execute the voice identification method based on the audio to be identified, and issue the control instruction through the network, so that the corresponding man-machine interaction equipment executes corresponding actions.

The voice recognition system may be disposed in an electronic device or a vehicle, and includes an audio input/output unit and a control unit, as shown in fig. 1, fig. 1 is a schematic diagram of the voice recognition system according to an embodiment of the present application.

The audio input output unit 111 in the speech recognition system 11 may be configured to perform receiving or playing audio signals. That is, the audio input output unit 111 may collect an audio signal in the environment, convert the audio signal from an analog signal to a digital signal through an analog-to-digital converter (a/D converter), and then use the converted audio as the audio to be recognized. On the other hand, the audio input output unit 111 may also play audio, thereby enabling human-machine interaction between the user 12 and the speech recognition system 11. Taking the vehicle as an example, the audio input output unit 111 may characterize one or more microphones in the vehicle for capturing audio and one or more sounds for playing audio.

The control unit 112 in the speech recognition system 11 may be configured to perform the above-described speech interaction method. As shown in fig. 2, the voice interaction method may include the steps of:

in step S21, audio to be recognized is acquired.

The audio to be identified can be acquired through an audio acquisition device which is arranged independently, and can also be acquired through an audio acquisition unit which is integrated in the electronic device or the vehicle.

Since environmental noise is inevitably collected when the audio to be identified is collected, in an alternative implementation manner, the embodiment of the present application may perform preprocessing on the audio to be identified in response to receiving the audio to be identified.

The preprocessing may include at least echo cancellation (Acoustic Echo Cancellation, AEC), audio noise reduction, voice activity detection (Voice activity Detection, VAD), and feature extraction, among others.

Specifically, in one implementation, the embodiment of the present application may perform echo cancellation on the audio to be identified by using a Normalized LEAST MEAN Square (NLMS) algorithm or other suitable methods, so as to enhance the effective speech signal. And audio noise reduction is performed on the audio to be identified by wiener filtering (WIENER FILTER, WF) or other applicable modes so as to reduce noise interference. Then, the embodiment of the application can detect the voice in the audio to be recognized according to the VAD algorithm and extract the voice in the audio to be recognized. Furthermore, the embodiment of the application can extract the characteristics (such as Fbank characteristics) in the voice through the preset frame length and the preset frame shift.

For example, the preset frame length may be 25 milliseconds (ms), and the preset frame shift may be 10ms, and according to the preset frame length and the preset frame shift, the embodiment of the present application may perform frame division processing on the voice and extract the 40-dimensional Fbank feature of the voice. Wherein Fbank features F _i of the ith speech can be characterized as:

wherein T may be used to characterize the number of time frames after framing the speech.

In step S22, an acoustic recognition result corresponding to the audio to be recognized is determined according to the pre-trained acoustic recognition model.

The acoustic recognition result at least comprises a dialect type and an audio text, the acoustic recognition model at least comprises a decoder (decoder) and a plurality of encoders (encoders), and each encoder is trained based on different dialect training sample sets.

That is, each encoder in the embodiment of the present application may be used to identify different dialects (such as mandarin, sichuan, and Guangdong), for example, when each encoder receives the same audio to be identified (i.e. the audio to be identified corresponding to Guangdong), the encoder corresponding to Guangdong may effectively identify the audio to be identified and output the output result with a larger data amount. Accordingly, since the other encoders cannot effectively recognize the audio to be recognized, the other encoders output results with smaller data amounts. Furthermore, the embodiment of the application can input the output results output by each encoder into the decoder, and the decoder processes each output result to output the acoustic recognition result. The output results output by the encoders corresponding to the cantonese are larger in proportion, so that when the decoder processes the output results, the output results output by the encoders corresponding to the cantonese have larger influence on the acoustic recognition results, and the output results output by the other encoders have smaller influence on the acoustic recognition results, so that the acoustic recognition results comprise the dialect type of the cantonese and the audio text of the cantonese.

In particular, the model structure of the acoustic recognition model may be an encoder-decoder structure based on an attention mechanism (attention). In this structure, the attention mechanism can be used to calculate the correlation degree of the corresponding data, so that the embodiment of the application can enable the acoustic recognition model to give different weights to different audio data through the attention mechanism, thereby realizing more accurate acoustic recognition.

In addition, the embodiment of the application can insert a full connection layer between the encoder and the decoder, wherein the full connection layer can be used for carrying out data fusion processing on each output result so as to obtain fusion characteristics. Therefore, through the full connection layer in the acoustic recognition model, data fusion processing can be performed on each output result in advance before each output result is input into the decoder, so that the decoder does not need to fuse each output result, and the operation pressure of the decoder is reduced.

Fig. 3 is a schematic flow chart of determining an acoustic recognition result according to an embodiment of the present application, as shown in fig. 3.

In determining the acoustic recognition result, the embodiment of the present application may input the audio 31 to be recognized into each encoder (the encoders 321 to 32n, where n is a natural number of 1 or more). Each encoder may perform processing such as feature extraction on the audio 31 to be recognized and determine an output result.

Further, after the encoders 321 to 32n determine the output results, the decoder 33 may determine the dialect type 341 and the audio text 342 from the respective output results.

According to the embodiment of the application, after the audio to be identified is acquired, the audio to be identified can be input into each encoder in the acoustic identification model, and the decoder outputs the acoustic identification result. Because each encoder trains based on different dialect training sample sets, when each encoder processes the same audio to be recognized, the output results of the encoders with the same dialect type can occupy a proportion obviously larger than that of other encoders, so that the acoustic recognition results correspond to the audio to be recognized, namely, the dialect type and the audio text of the audio to be recognized can be accurately recognized.

In an alternative embodiment, the acoustic recognition model may further include a full connection layer, and further, the step S22 may include the steps of:

In step S221, the audio to be recognized is input into each encoder in the acoustic recognition model to determine the output result of each encoder.

Fig. 4 is a schematic flow chart of another method for determining an acoustic recognition result according to an embodiment of the present application.

In combination with step S221, after the audio 41 to be identified is acquired, the audio 41 to be identified may be input to each encoder (encoder 421 to encoder 42n, where n is a natural number greater than or equal to 1). Each encoder may perform processing such as feature extraction on the audio 41 to be recognized and determine an output result.

In step S222, the output result of each encoder is input to the full-connection layer to determine the fusion feature corresponding to the audio to be identified.

The full connection layer can be used for carrying out data fusion processing on each output result so as to obtain fusion characteristics. Therefore, through the full connection layer in the acoustic recognition model, data fusion processing can be performed on each output result in advance before each output result is input into the decoder, so that the decoder does not need to fuse each output result, and the operation pressure of the decoder is reduced.

As shown in fig. 4, after the encoders 421 to 42n determine the output results, the embodiment of the present application may input each output result to the full connection layer 43, so that the full connection layer 43 performs data fusion processing on each output result to determine the fusion characteristics.

In step S223, the fusion feature is input into a decoder in the acoustic recognition model to determine an acoustic recognition result corresponding to the audio to be recognized.

After the full connection layer 43 determines the fusion characteristics, as shown in fig. 4, embodiments of the present application may input the fusion characteristics to the decoder 44, and the decoder 44 may determine the dialect type 451 and the audio text 452 according to the fusion characteristics.

In step S23, a speech recognition result corresponding to the acoustic recognition result is determined according to the pre-trained language processing model.

Wherein the language processing model may be used for natural language processing, which may include, but is not limited to, natural language understanding and natural language generation. For example, the language processing model may recognize the type of dialect and the audio text from the type of dialect and the audio text in the acoustic recognition result, further, the language processing model may create a narrative describing, summarizing, or interpreting the input (structured data) in a human-like manner from the result of the semantic understanding after the natural language understanding, and feedback the result.

Furthermore, the embodiment of the application can determine the accurate voice recognition result according to the accurate dialect type and the audio text. In the process, the embodiment of the application can realize the recognition of various dialects without independently setting a classification algorithm, so that the embodiment of the application can improve the accuracy of voice recognition and the response speed of voice recognition.

In an alternative embodiment, the language processing model may be a multi-byte fragment (n-gram) language processing model.

The n-gram is a language model for large-vocabulary continuous speech recognition, and the language model can realize natural language processing by utilizing collocation information between adjacent words in the context. In addition, the model is based on the assumption that the occurrence of the nth word is related to only the preceding n-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words. These probabilities can be obtained by directly counting the number of simultaneous occurrences of n words from the corpus. For example, a 4-gram is a 4-tuple fragment language processing model. Additionally, the n-gram language processing model may be trained based on KenLM language model tools.

In an alternative embodiment, the step S23 may include the steps of:

in step S231, a target language processing model corresponding to the acoustic recognition result is determined according to the dialect type in the acoustic recognition result.

In step S232, a speech recognition result corresponding to the acoustic recognition result is determined according to the target language processing model.

According to the embodiment of the application, different language processing models can be set according to different dialect types, and after the acoustic recognition result is determined, a corresponding target language processing model can be determined according to the dialect types in the acoustic recognition result. Thus, the language processing process can be more targeted.

In an alternative embodiment, as shown in fig. 5, the process of training the acoustic recognition model may include the steps of:

In step S51, a target training set is acquired.

The target training set comprises a plurality of sub-training sets, and the sub-training sets at least comprise sample audio corresponding to dialects, labeled text and dialect labels. Specifically, the labeling text is a text corresponding to the voice in the sample audio, that is, the labeling text is voice content corresponding to the sample audio. The dialect labels are used to characterize the types of dialects (e.g., mandarin, sichuan, cantonese, etc.) to which the sample audio corresponds.

In addition, after the sample audio in the target training set is obtained, the embodiment of the application can preprocess the sample audio. The preprocessing may include at least echo cancellation, audio noise reduction, voice activity detection, and feature extraction, among others.

In an alternative implementation, as shown in fig. 6, the process of generating the target training set according to the embodiment of the present application may include the following steps:

In step S61, one or more pieces of audio to be processed are acquired.

After the audio to be processed is acquired, the embodiment of the application can preprocess the audio to be processed. The preprocessing may include at least echo cancellation, audio noise reduction, voice activity detection, and feature extraction, among others.

In step S62, a dialect audio dictionary is generated according to the single-word audio in the audio to be processed and the labeling text corresponding to the single-word audio.

The embodiment of the application can generate a corresponding dialect audio dictionary for each dialect type, and can also generate 1 dialect audio dictionary based on each dialect type.

Specifically, according to the embodiment of the application, the words with the word frequency smaller than the preset threshold value can be removed according to the word frequency of the word audio in the audio to be processed, and the words with the word frequency larger than or equal to the preset threshold value are reserved, so that the dialect audio dictionary is generated, wherein the preset threshold value can be a value suitable for 40, 50, 60 and the like.

In an alternative embodiment, the step S62 may include the following steps:

In step S621, in response to the same pronunciation mode corresponding to each single word audio with different dialect types and the same labeling text, a dialect audio dictionary corresponding to the dialect type with the same pronunciation mode is generated according to each single word audio with the same pronunciation mode and the corresponding labeling text.

That is, in the process of generating the dialect audio dictionary, the embodiment of the application can combine single word audios with the same pronunciation mode to generate the dialect audio dictionary.

For example, as shown in fig. 7, the process of generating a dialect audio dictionary according to an embodiment of the present application may include the following steps:

In step S71, single-word audio of mandarin chinese in each of the audio to be processed is determined, and the set M is generated according to the single-word audio with a word frequency of 50 or more.

In practical application, because mandarin is the speaking mode of most people, the embodiment of the application can generate the set M based on mandarin first, and then add other dialects into the set M to generate the dialect audio dictionary.

In step S72, it is determined whether the pronunciation of the dialect is the same as that of mandarin chinese, if the pronunciation of the dialect is the same as that of mandarin chinese, step S74 is executed, and if the pronunciation of the dialect is different from that of mandarin chinese, step S73 is executed.

That is, the embodiment of the application can combine the single word audios with the same pronunciation mode and add the single word audios into the set M.

In step S73, the single-word audio with the dialect lower word frequency equal to or greater than 50 is added to the set M.

In step S74, the single-word audio with the dialect word frequency greater than or equal to 50 is compared with the set M, and the single-word audio not existing in the set M is added into the set M.

That is, in addition to combining word audio similar to the mandarin pronunciation mode with mandarin, the embodiment of the present application may also add word audio with unique pronunciation under the dialect to the set M, so as to expand the number of word audio in the dialect audio dictionary.

In step S75, a dialect audio dictionary is generated.

The dialect audio dictionary generated in step S75 includes both single-word audio corresponding to mandarin, single-word audio similar to mandarin, and single-word audio different from mandarin. Therefore, through the embodiment of the application, the number and the variety of single-word audios of the dialect audio dictionary can be expanded.

In step S63, a target training set is generated from the dialect audio dictionary.

In the process of generating the target training set, the embodiment of the application can take single-word audio or a combination of single-word audio in the dialect audio dictionary as sample audio in the sub-training set. And taking the marked text corresponding to the single-word audio in the dialect audio dictionary as the marked text in the sub-training set. And taking the dialect type corresponding to the single-word audio in the dialect audio dictionary as the dialect label in the sub-training set.

In practical application, the embodiment of the application can generate the target training set based on the dialect audio dictionary without temporarily collecting sample data, thereby improving the model training efficiency.

In step S52, the sample audio in the sub-training set is input to each encoder in the acoustic recognition model, and the encoder corresponding to the dialect label is trained according to the output result of the encoder corresponding to the dialect label and the labeling text.

In step S53, the decoder in the acoustic recognition model is trained based on the output results of the respective encoders in the acoustic recognition model and the labeling text.

The encoder and decoder in the acoustic recognition model can be trained through a random gradient descent method (Stochastic GRADIENT DESCENT, SGD), and particularly the encoder and decoder in the acoustic recognition model can be trained based on a small batch random gradient descent method (mini-batch SGD).

After the acoustic recognition model and the language processing model are trained, the embodiment of the application can determine the acoustic recognition result corresponding to the audio to be recognized based on the trained acoustic recognition model, and determine the voice recognition result corresponding to the acoustic recognition result based on the trained language processing model.

Therefore, according to the embodiment of the application, after the audio to be identified is acquired, the audio to be identified can be input into each encoder in the acoustic identification model, and the decoder outputs the acoustic identification result. Because each encoder trains based on different dialect training sample sets, when each encoder processes the same audio to be recognized, the output results of the encoders with the same dialect type can occupy a proportion obviously larger than that of other encoders, so that the acoustic recognition results correspond to the audio to be recognized, namely, the dialect type and the audio text of the audio to be recognized can be accurately recognized.

Based on the same technical concept, the embodiment of the application also provides a vehicle, as shown in fig. 8, in which the voice recognition system shown in fig. 1 is arranged.

In practical applications, the driver may interact with the voice recognition system 11 during driving of the vehicle 81, and the voice recognition system 11 may recognize the voice of the driver through the above-mentioned voice recognition method.

In this process, the voice recognition system 11 may acquire the voice of the driver and take the voice of the driver as the audio to be recognized, and further, the voice recognition system 11 may input the audio to be recognized into each encoder in the acoustic recognition model and output the acoustic recognition result by the decoder. Because each encoder trains based on different dialect training sample sets, when each encoder processes the same audio to be recognized, the output results of the encoders with the same dialect type can occupy a proportion obviously larger than that of other encoders, so that the acoustic recognition results correspond to the audio to be recognized, namely, the dialect type and the audio text of the audio to be recognized can be accurately recognized.

Further, the speech recognition system 11 may determine accurate speech recognition results based on the accurate dialect type and the audio text. In this process, since the speech recognition system 11 can recognize a plurality of dialects without separately setting a classification algorithm, the speech recognition system 11 can improve the accuracy of speech recognition and the response speed of speech recognition.

Based on the same technical concept, the embodiment of the application also provides a voice recognition device, as shown in fig. 9, which comprises: an audio acquisition module 91 to be recognized, an acoustic recognition module 92, and a language processing module 93.

The audio to be recognized acquisition module 91 is configured to perform acquisition of audio to be recognized.

An acoustic recognition module 92 configured to perform determining an acoustic recognition result corresponding to the audio to be recognized according to a pre-trained acoustic recognition model, the acoustic recognition result comprising at least a dialect type and an audio text, the acoustic recognition model comprising at least a decoder and a plurality of encoders, each of the encoders being trained based on a different set of dialect training samples, respectively.

The language processing module 93 is configured to determine a speech recognition result corresponding to the acoustic recognition result according to a pre-trained language processing model.

In some embodiments, the apparatus further comprises:

The acoustic recognition module 92 is specifically configured to:

In some embodiments, the language processing module 93 is specifically configured to:

Fig. 10 is a schematic diagram of an electronic device according to an embodiment of the application. As shown in fig. 10, the electronic device shown in fig. 10 is a general address query device, which includes a general computer hardware structure including at least a processor 101 and a memory 102. The processor 101 and the memory 102 are connected by a bus 103. The memory 102 is adapted to store instructions or programs executable by the processor 101. The processor 101 may be a separate microprocessor or may be a collection of one or more microprocessors. Thus, the processor 101 implements processing of data and control of other devices by executing instructions stored by the memory 102 to perform the method flows of embodiments of the application as described above. Bus 103 connects the above components together and connects the above components to display controller 104 and display device and input/output (I/O) device 105. Input/output (I/O) device 105 may be a mouse, keyboard, modem, network interface, touch input device, somatosensory input device, printer, and other devices known in the art. Typically, the input/output devices 105 are connected to the system through input/output (I/O) controllers 106.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each of the flows in the flowchart may be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the present application is directed to a non-volatile storage medium storing a computer readable program for causing a computer to perform some or all of the method embodiments described above.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by specifying relevant hardware by a program, where the program is stored in a storage medium, and includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Another embodiment of the application relates to a computer program product comprising a computer program/instruction which, when executed by a processor, can implement some or all of the above-described method embodiments.

That is, those skilled in the art will appreciate that embodiments of the application may be implemented by a processor executing a computer program product (computer program/instructions) to specify associated hardware, including the processor itself, to carry out all or part of the steps of the methods of the embodiments described above.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

Acquiring audio to be identified;

determining an acoustic recognition result corresponding to the audio to be recognized according to a pre-trained acoustic recognition model, wherein the acoustic recognition result at least comprises a dialect type and an audio text, the acoustic recognition model at least comprises a decoder and a plurality of encoders, and each encoder is trained based on different dialect training sample sets; and

2. The method of claim 1, wherein the acoustic recognition model is trained based on the steps of:

Obtaining a target training set, wherein the target training set comprises a plurality of sub-training sets, and the sub-training sets at least comprise sample audio corresponding to dialects, labeled text and dialect labels;

Inputting sample audio in the sub-training set into each encoder in the acoustic recognition model, and training the encoder corresponding to the dialect label according to the output result of the encoder corresponding to the dialect label and the labeling text; and

3. The method according to claim 2, wherein the method further comprises:

acquiring one or more sections of audio to be processed;

generating a dialect audio dictionary according to single-word audio in the audio to be processed and a labeling text corresponding to the single-word audio; and

And generating a target training set according to the dialect audio dictionary.

4. The method of claim 3, wherein generating a dialect audio dictionary corresponding to each dialect type from single-word audio in the audio to be processed and the labeling text corresponding to the single-word audio comprises:

5. The method according to claim 3 or 4, characterized in that the method further comprises:

6. The method of claim 1, wherein the acoustic recognition model further comprises a fully connected layer;

Inputting the audio to be identified into each encoder in the acoustic identification model to determine an output result of each encoder;

inputting the output result of each encoder into the full-connection layer to determine the fusion characteristics corresponding to the audio to be identified; and

7. The method of claim 1, wherein determining the speech recognition result corresponding to the acoustic recognition result based on a pre-trained language processing model comprises:

determining a target language processing model corresponding to the acoustic recognition result according to the dialect type in the acoustic recognition result; and

8. The method of claim 1, wherein the language processing model is a multi-byte fragment n-gram language processing model.

9. A speech recognition system, the system comprising:

An audio input output unit configured to perform receiving or playing an audio signal; and

A control unit configured to perform the method of any one of claims 1-8.

10. A vehicle in which the speech recognition system of claim 9 is provided.

11. A speech recognition device, the device comprising:

the audio acquisition module to be identified is configured to perform acquisition of audio to be identified;

An acoustic recognition module configured to perform determining an acoustic recognition result corresponding to the audio to be recognized according to a pre-trained acoustic recognition model, the acoustic recognition result including at least a dialect type and an audio text, the acoustic recognition model including at least a decoder and a plurality of encoders, each of the encoders being trained based on a different set of dialect training samples, respectively; and

12. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.

13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-8.

14. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-8.