CN112509565A

CN112509565A - Voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN112509565A
Application number: CN202011268976.9A
Authority: CN
Inventors: 赖勇铨
Original assignee: China Citic Bank Corp Ltd
Current assignee: China Citic Bank Corp Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-03-16

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set, then performing character error detection on the candidate text in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text, and then determining the target text of the voice to be recognized based on the character error detection result of each candidate text. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.

Description

Voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium.

Background

Speech recognition typically consists of two parts, a speech model and a language model. The speech model is responsible for converting the audio into a sequence of words and outputting corresponding probabilities, such as the pronunciation chi fan might be output as (eat 0.99, this 0.01), (rice 0.8, double 0.1), with the numbers representing the probability that the word matches the pronunciation. The language model is responsible for routing the output of the speech model, for example, the aforementioned outputs have four possible combinations: eating twice as a meal. The language model scores the four candidate combinations respectively, and finally selects 'eating' as final output by combining the probability of language grammar and the probability of pronunciation. If the chi fan face has subsequent pronunciations, such as the chi fan qian lai, then the language model will choose this "as output with a high probability, so that the last statement is" this coming ". It can be seen that the language model helps to solve the problem of text selection in speech recognition, especially when some ambiguous pronunciations are encountered, the language model is required to participate in the final decision.

In order to more accurately understand the intention of a speaker and realize accurate transcription, the prior art determines the context of a user before transcription, such as in a hospital environment, and uses a language model related to medicine, so that some professional terms can be more accurately translated. For example, google led the CN104508739B patent to use a larger general language model for basic transcription, and then add one step of context detection to the transcribed text, if the context is detected to be a medical environment, then use a medical language model. However, this method needs to solve the problem of a context classification, and it is difficult to ensure the accuracy of the transcription when the context is not clear or is fuzzy.

Disclosure of Invention

The application provides a speech recognition method, a speech recognition device, an electronic device and a readable storage medium, which are used for avoiding context detection, directly judging whether an output has errors through a character error detector (WED), and selecting a sentence with the least errors or the smallest error probability value as a final output according to the number and the probability of possible grammar errors.

The technical scheme adopted by the application is as follows:

in a first aspect, a speech recognition method is provided, which includes:

performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;

performing character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;

and determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.

Optionally, the target set of speech recognition models comprises speech recognition models in at least two different contexts.

Optionally, performing speech recognition on the speech to be recognized based on the speech recognition model in the target speech recognition model set to obtain a candidate text set, including:

and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.

Optionally, the character error detection of the candidate texts in the candidate text set based on the pre-trained character error detector includes:

inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;

and taking the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text to the classifier after passing through a full-connection network to obtain the correct probability value of each character.

Optionally, determining a target text for performing speech recognition on a speech to be recognized based on the character error detection result of each candidate text includes:

determining the character error rate of each candidate text;

and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.

Optionally, the method further comprises:

acquiring user information of a voice to be recognized;

at least two speech recognition models are determined from the plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.

Optionally, the label construction of the pre-trained character error detector training samples comprises:

acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;

the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.

In a second aspect, a speech recognition apparatus is provided, including:

the voice recognition module is used for carrying out voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;

the detection module is used for carrying out character error detection on the candidate texts in the candidate text set based on the pre-trained character error detector to obtain a character error detection result of each candidate text;

and the determining module is used for determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.

Optionally, the speech recognition module comprises:

and the voice recognition unit is used for carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.

Optionally, the detection module comprises:

the input unit is used for inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;

and the classification unit is used for inputting the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text into the classifier after passing through a full-connection network to obtain the correct probability value of each character.

Optionally, the determining module includes:

a determining unit, configured to determine a character error rate of each candidate text;

and the unit is used for taking the candidate text with low character error rate as the target text for performing voice recognition on the voice to be recognized.

Optionally, the apparatus further comprises:

the first acquisition module is used for acquiring user information of the voice to be recognized;

and the screening and determining module is used for screening and determining at least two speech recognition models from the candidate speech recognition models based on the user information to obtain a target speech recognition model set.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring the training text sample and replacing characters in the training text sample with a certain probability value;

and the setting module is used for setting the label corresponding to the position where the character is not changed to be 1 and setting the label corresponding to the position where the character is changed to be 0.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the speech recognition method shown in the first aspect is performed.

In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method shown in the first aspect.

Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice recognition method and device for the electronic equipment perform voice recognition on the voice to be recognized through the voice recognition model in the target voice recognition model set to obtain a candidate text set, then character error detection is performed on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then the target text of the voice to be recognized is determined based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an exemplary speech recognition process according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 5 is a diagram illustrating an exemplary character detection of a character error detector implemented in the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Example one

An embodiment of the present application provides a speech recognition method, as shown in fig. 1, the method may include the following steps:

step S101, performing voice recognition on a voice to be recognized based on a voice recognition model in a target voice recognition model set to obtain a candidate text set; wherein the target speech recognition model set comprises speech recognition models in at least two different contexts, such as speech recognition models in medical aspect, speech recognition models in financial field, and the like. One speech recognition model recognizes the speech to be recognized, so that a plurality of candidate texts can be obtained, and one candidate text can also be obtained (the candidate text with the highest probability is selected from the possible candidate texts).

Step S102, character error detection is carried out on candidate texts in a candidate text set based on a pre-trained character error detector, and character error detection results of all the candidate texts are obtained;

the pre-trained character error detector is used for performing character error detection on the candidate texts in the candidate text set and determining whether the characters in the candidate texts are correct.

Step S103, determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.

Specifically, the character error rate or the character correct rate of each candidate text may be counted, and the candidate text with the lowest error rate or the candidate text with the highest correct rate is determined as the target text for performing the speech recognition on the speech to be recognized.

Fig. 5 shows an exemplary flow chart of speech recognition in the embodiment of the present application, in which the speech to be recognized passes through an Acoustic Model (AM) to output different candidate chinese characters and their probability values, different LM models convert the output of each AM into a plurality of different sentences, and a detection model (WED) detects the sentences and finds the sentence with the least number of error words or the smallest probability.

Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice to be recognized is subjected to voice recognition through the voice recognition model based on the target voice recognition model set to obtain a candidate text set, then character error detection is carried out on candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then the target text of the voice to be recognized is determined based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.

The embodiment of the present application provides a possible implementation manner, specifically, performing speech recognition on a to-be-recognized speech based on a speech recognition model in a target speech recognition model set to obtain a candidate text set, including:

Specifically, any speech recognition model may perform speech recognition on a speech to be recognized to obtain a plurality of candidate texts and a probability value of each candidate text, in this case, the number of the candidate texts obtained by one speech recognition model is a set of candidate texts respectively obtained by each speech recognition model; the candidate text with the highest probability value in the candidate texts can also be used as any speech recognition model to perform speech recognition on the speech to be recognized to obtain the candidate text, and in this case, one speech recognition model obtains one candidate text.

The embodiment of the present application provides a possible implementation manner, and specifically, performing character error detection on candidate texts in a candidate text set based on a pre-trained character error detector includes:

Illustratively, fig. 3 shows an exemplary diagram of a character detection flow of a character error detector, wherein the character error detection adopts a pre-trained transform, and implements word-by-word detection in cooperation with an RNN model. The candidate text is input into a pre-trained transformer (e.g., a bert model) to obtain corresponding state vectors, where each Chinese character corresponds to one state vector, and each vector is usually 768 dimensions. These vectors are then used as the input to RNN (which may be LSTM or GRU), and finally the probability (e.g., 1, or 0) that each word is correct is obtained by passing through a full connection layer (FC) and taking softmax

The embodiment of the present application provides a possible implementation manner, and specifically, determining a target text for performing speech recognition on a speech to be recognized based on a character error detection result of each candidate text includes:

determining the character error rate of each candidate text;

The embodiment of the present application provides a possible implementation manner, and further, the method further includes:

acquiring user information of a voice to be recognized;

The embodiment of the present application provides a possible implementation manner, and specifically, the label construction of the training sample of the pre-trained character error detector includes:

Specifically, training of the character error detector is achieved through data augmentation, i.e., the construction of output sentences and labels (0/1 sequence) by a program. The input sentences can be obtained by collecting a plurality of network texts, for each row of sentences, the Chinese character characters or phrases in each row of sentences are replaced by a certain probability value, for the unchanged position, the corresponding output value is 1, otherwise, the output value is 0, and therefore a large number of training corpora and label construction thereof are achieved.

Illustratively, an example not taking the technical solution of the present application, according to the prior art, the same speech may output different sentences through two different language models, as shown in table 1:

TABLE 1

Taking the first set of sentences as an example, it is difficult for the algorithm to determine which sentence a/B is really intended by the user. The conventional method is to obtain a basic sentence through a general basic language model, and then determine the context in which the sentence expressed by the user is according to the basic sentence (here, it is required to determine the context in which the user is related to the pharmacy). And finally combining the pre-trained language model under the corresponding context with the output of the speech model (AM) to finally obtain a transcription result. The embodiment of the application adopts different ideas, and solves the ambiguity problem in speech recognition by traversing all language models and selecting the most suitable sentence from the output of all language models as the final transcription result, namely mismatching of the language models causes misuse of the vocabulary in the sentence or does not accord with the use habit.

Example two

Fig. 2 is a speech recognition apparatus according to an embodiment of the present application, where the apparatus 20 includes: a speech recognition module 201, a detection module 202, a determination module 20, wherein,

the voice recognition module 201 is configured to perform voice recognition on a to-be-recognized voice based on a voice recognition model in the target voice recognition model set to obtain a candidate text set;

the detection module 202 is configured to perform character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;

and the determining module 203 is configured to determine a target text for performing speech recognition on the speech to be recognized based on the character error detection result of each candidate text.

In particular, the target set of speech recognition models comprises speech recognition models in at least two different contexts.

Specifically, the speech recognition module includes:

Specifically, the detection module includes:

Specifically, the determining module includes:

Specifically, the apparatus further comprises:

Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice recognition device performs voice recognition on voice to be recognized based on the voice recognition model in a target voice recognition model set to obtain a candidate text set, performs character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then determines the target text of the voice to be recognized based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.

The apparatus of the embodiment of the present application can execute the method shown in the first embodiment of the present application, and the implementation effect is similar, which is not described herein again.

EXAMPLE III

An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application, and is used to implement the functions of the modules shown in fig. 2. The transceiver 404 includes a receiver and a transmitter.

The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. The processor 401 is configured to execute application program code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.

The embodiment of the present application provides an electronic device suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.

Example four

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the speech recognition method shown in the above embodiment.

The embodiment of the present application provides a computer-readable storage medium suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A speech recognition method, comprising:

performing character error detection on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;

2. The method of claim 1, wherein the set of target speech recognition models comprises speech recognition models in at least two different contexts.

3. The method of claim 1, wherein performing speech recognition on the speech to be recognized based on the speech recognition model in the target speech recognition model set to obtain a candidate text set comprises:

4. The method of claim 1, wherein the pre-training based character error detector performs character error detection on candidate texts in the candidate text set, comprising:

and taking the state vector of the candidate text as the input of a recurrent neural network, and inputting the state vector of the candidate text to a classifier after passing through a full-connection network to obtain the correct probability value of each character.

5. The method of claim 1, wherein determining the target text for performing speech recognition on the speech to be recognized based on the character error detection result of each candidate text comprises:

determining the character error rate of each candidate text;

6. The method of claim 1, further comprising:

acquiring user information of the voice to be recognized;

determining at least two speech recognition models from a plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.

7. The method of any of claims 1-6, wherein label building of the pre-trained character error detector training samples comprises:

8. A speech recognition apparatus, comprising:

the detection module is used for carrying out character error detection on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing a speech recognition method according to any of claims 1 to 7.

10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the speech recognition method of any of claims 1 to 7.