CN112509565A - Voice recognition method and device, electronic equipment and readable storage medium - Google Patents
Voice recognition method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN112509565A CN112509565A CN202011268976.9A CN202011268976A CN112509565A CN 112509565 A CN112509565 A CN 112509565A CN 202011268976 A CN202011268976 A CN 202011268976A CN 112509565 A CN112509565 A CN 112509565A
- Authority
- CN
- China
- Prior art keywords
- candidate
- candidate text
- text
- voice
- voice recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000001514 detection method Methods 0.000 claims abstract description 61
- 239000013598 vector Substances 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000000306 recurrent effect Effects 0.000 claims description 5
- 238000013518 transcription Methods 0.000 description 6
- 230000035897 transcription Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 235000012054 meals Nutrition 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set, then performing character error detection on the candidate text in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text, and then determining the target text of the voice to be recognized based on the character error detection result of each candidate text. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
Description
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium.
Background
Speech recognition typically consists of two parts, a speech model and a language model. The speech model is responsible for converting the audio into a sequence of words and outputting corresponding probabilities, such as the pronunciation chi fan might be output as (eat 0.99, this 0.01), (rice 0.8, double 0.1), with the numbers representing the probability that the word matches the pronunciation. The language model is responsible for routing the output of the speech model, for example, the aforementioned outputs have four possible combinations: eating twice as a meal. The language model scores the four candidate combinations respectively, and finally selects 'eating' as final output by combining the probability of language grammar and the probability of pronunciation. If the chi fan face has subsequent pronunciations, such as the chi fan qian lai, then the language model will choose this "as output with a high probability, so that the last statement is" this coming ". It can be seen that the language model helps to solve the problem of text selection in speech recognition, especially when some ambiguous pronunciations are encountered, the language model is required to participate in the final decision.
In order to more accurately understand the intention of a speaker and realize accurate transcription, the prior art determines the context of a user before transcription, such as in a hospital environment, and uses a language model related to medicine, so that some professional terms can be more accurately translated. For example, google led the CN104508739B patent to use a larger general language model for basic transcription, and then add one step of context detection to the transcribed text, if the context is detected to be a medical environment, then use a medical language model. However, this method needs to solve the problem of a context classification, and it is difficult to ensure the accuracy of the transcription when the context is not clear or is fuzzy.
Disclosure of Invention
The application provides a speech recognition method, a speech recognition device, an electronic device and a readable storage medium, which are used for avoiding context detection, directly judging whether an output has errors through a character error detector (WED), and selecting a sentence with the least errors or the smallest error probability value as a final output according to the number and the probability of possible grammar errors.
The technical scheme adopted by the application is as follows:
in a first aspect, a speech recognition method is provided, which includes:
performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
performing character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
Optionally, the target set of speech recognition models comprises speech recognition models in at least two different contexts.
Optionally, performing speech recognition on the speech to be recognized based on the speech recognition model in the target speech recognition model set to obtain a candidate text set, including:
and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Optionally, the character error detection of the candidate texts in the candidate text set based on the pre-trained character error detector includes:
inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and taking the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text to the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Optionally, determining a target text for performing speech recognition on a speech to be recognized based on the character error detection result of each candidate text includes:
determining the character error rate of each candidate text;
and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.
Optionally, the method further comprises:
acquiring user information of a voice to be recognized;
at least two speech recognition models are determined from the plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.
Optionally, the label construction of the pre-trained character error detector training samples comprises:
acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;
the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.
In a second aspect, a speech recognition apparatus is provided, including:
the voice recognition module is used for carrying out voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
the detection module is used for carrying out character error detection on the candidate texts in the candidate text set based on the pre-trained character error detector to obtain a character error detection result of each candidate text;
and the determining module is used for determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
Optionally, the target set of speech recognition models comprises speech recognition models in at least two different contexts.
Optionally, the speech recognition module comprises:
and the voice recognition unit is used for carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Optionally, the detection module comprises:
the input unit is used for inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and the classification unit is used for inputting the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text into the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Optionally, the determining module includes:
a determining unit, configured to determine a character error rate of each candidate text;
and the unit is used for taking the candidate text with low character error rate as the target text for performing voice recognition on the voice to be recognized.
Optionally, the apparatus further comprises:
the first acquisition module is used for acquiring user information of the voice to be recognized;
and the screening and determining module is used for screening and determining at least two speech recognition models from the candidate speech recognition models based on the user information to obtain a target speech recognition model set.
Optionally, the apparatus further comprises:
the second acquisition module is used for acquiring the training text sample and replacing characters in the training text sample with a certain probability value;
and the setting module is used for setting the label corresponding to the position where the character is not changed to be 1 and setting the label corresponding to the position where the character is changed to be 0.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the speech recognition method shown in the first aspect is performed.
In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method shown in the first aspect.
Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice recognition method and device for the electronic equipment perform voice recognition on the voice to be recognized through the voice recognition model in the target voice recognition model set to obtain a candidate text set, then character error detection is performed on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then the target text of the voice to be recognized is determined based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an exemplary speech recognition process according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 5 is a diagram illustrating an exemplary character detection of a character error detector implemented in the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example one
An embodiment of the present application provides a speech recognition method, as shown in fig. 1, the method may include the following steps:
step S101, performing voice recognition on a voice to be recognized based on a voice recognition model in a target voice recognition model set to obtain a candidate text set; wherein the target speech recognition model set comprises speech recognition models in at least two different contexts, such as speech recognition models in medical aspect, speech recognition models in financial field, and the like. One speech recognition model recognizes the speech to be recognized, so that a plurality of candidate texts can be obtained, and one candidate text can also be obtained (the candidate text with the highest probability is selected from the possible candidate texts).
Step S102, character error detection is carried out on candidate texts in a candidate text set based on a pre-trained character error detector, and character error detection results of all the candidate texts are obtained;
the pre-trained character error detector is used for performing character error detection on the candidate texts in the candidate text set and determining whether the characters in the candidate texts are correct.
Step S103, determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
Specifically, the character error rate or the character correct rate of each candidate text may be counted, and the candidate text with the lowest error rate or the candidate text with the highest correct rate is determined as the target text for performing the speech recognition on the speech to be recognized.
Fig. 5 shows an exemplary flow chart of speech recognition in the embodiment of the present application, in which the speech to be recognized passes through an Acoustic Model (AM) to output different candidate chinese characters and their probability values, different LM models convert the output of each AM into a plurality of different sentences, and a detection model (WED) detects the sentences and finds the sentence with the least number of error words or the smallest probability.
Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice to be recognized is subjected to voice recognition through the voice recognition model based on the target voice recognition model set to obtain a candidate text set, then character error detection is carried out on candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then the target text of the voice to be recognized is determined based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
The embodiment of the present application provides a possible implementation manner, specifically, performing speech recognition on a to-be-recognized speech based on a speech recognition model in a target speech recognition model set to obtain a candidate text set, including:
and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Specifically, any speech recognition model may perform speech recognition on a speech to be recognized to obtain a plurality of candidate texts and a probability value of each candidate text, in this case, the number of the candidate texts obtained by one speech recognition model is a set of candidate texts respectively obtained by each speech recognition model; the candidate text with the highest probability value in the candidate texts can also be used as any speech recognition model to perform speech recognition on the speech to be recognized to obtain the candidate text, and in this case, one speech recognition model obtains one candidate text.
The embodiment of the present application provides a possible implementation manner, and specifically, performing character error detection on candidate texts in a candidate text set based on a pre-trained character error detector includes:
inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and taking the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text to the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Illustratively, fig. 3 shows an exemplary diagram of a character detection flow of a character error detector, wherein the character error detection adopts a pre-trained transform, and implements word-by-word detection in cooperation with an RNN model. The candidate text is input into a pre-trained transformer (e.g., a bert model) to obtain corresponding state vectors, where each Chinese character corresponds to one state vector, and each vector is usually 768 dimensions. These vectors are then used as the input to RNN (which may be LSTM or GRU), and finally the probability (e.g., 1, or 0) that each word is correct is obtained by passing through a full connection layer (FC) and taking softmax
The embodiment of the present application provides a possible implementation manner, and specifically, determining a target text for performing speech recognition on a speech to be recognized based on a character error detection result of each candidate text includes:
determining the character error rate of each candidate text;
and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.
The embodiment of the present application provides a possible implementation manner, and further, the method further includes:
acquiring user information of a voice to be recognized;
at least two speech recognition models are determined from the plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.
The embodiment of the present application provides a possible implementation manner, and specifically, the label construction of the training sample of the pre-trained character error detector includes:
acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;
the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.
Specifically, training of the character error detector is achieved through data augmentation, i.e., the construction of output sentences and labels (0/1 sequence) by a program. The input sentences can be obtained by collecting a plurality of network texts, for each row of sentences, the Chinese character characters or phrases in each row of sentences are replaced by a certain probability value, for the unchanged position, the corresponding output value is 1, otherwise, the output value is 0, and therefore a large number of training corpora and label construction thereof are achieved.
Illustratively, an example not taking the technical solution of the present application, according to the prior art, the same speech may output different sentences through two different language models, as shown in table 1:
TABLE 1
Taking the first set of sentences as an example, it is difficult for the algorithm to determine which sentence a/B is really intended by the user. The conventional method is to obtain a basic sentence through a general basic language model, and then determine the context in which the sentence expressed by the user is according to the basic sentence (here, it is required to determine the context in which the user is related to the pharmacy). And finally combining the pre-trained language model under the corresponding context with the output of the speech model (AM) to finally obtain a transcription result. The embodiment of the application adopts different ideas, and solves the ambiguity problem in speech recognition by traversing all language models and selecting the most suitable sentence from the output of all language models as the final transcription result, namely mismatching of the language models causes misuse of the vocabulary in the sentence or does not accord with the use habit.
Example two
Fig. 2 is a speech recognition apparatus according to an embodiment of the present application, where the apparatus 20 includes: a speech recognition module 201, a detection module 202, a determination module 20, wherein,
the voice recognition module 201 is configured to perform voice recognition on a to-be-recognized voice based on a voice recognition model in the target voice recognition model set to obtain a candidate text set;
the detection module 202 is configured to perform character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and the determining module 203 is configured to determine a target text for performing speech recognition on the speech to be recognized based on the character error detection result of each candidate text.
In particular, the target set of speech recognition models comprises speech recognition models in at least two different contexts.
Specifically, the speech recognition module includes:
and the voice recognition unit is used for carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Specifically, the detection module includes:
the input unit is used for inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and the classification unit is used for inputting the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text into the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Specifically, the determining module includes:
a determining unit, configured to determine a character error rate of each candidate text;
and the unit is used for taking the candidate text with low character error rate as the target text for performing voice recognition on the voice to be recognized.
Specifically, the apparatus further comprises:
the first acquisition module is used for acquiring user information of the voice to be recognized;
and the screening and determining module is used for screening and determining at least two speech recognition models from the candidate speech recognition models based on the user information to obtain a target speech recognition model set.
Specifically, the apparatus further comprises:
the second acquisition module is used for acquiring the training text sample and replacing characters in the training text sample with a certain probability value;
and the setting module is used for setting the label corresponding to the position where the character is not changed to be 1 and setting the label corresponding to the position where the character is changed to be 0.
Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice recognition device performs voice recognition on voice to be recognized based on the voice recognition model in a target voice recognition model set to obtain a candidate text set, performs character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then determines the target text of the voice to be recognized based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
The apparatus of the embodiment of the present application can execute the method shown in the first embodiment of the present application, and the implementation effect is similar, which is not described herein again.
EXAMPLE III
An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application, and is used to implement the functions of the modules shown in fig. 2. The transceiver 404 includes a receiver and a transmitter.
The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. The processor 401 is configured to execute application program code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.
The embodiment of the present application provides an electronic device suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.
Example four
The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the speech recognition method shown in the above embodiment.
The embodiment of the present application provides a computer-readable storage medium suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.
Claims (10)
1. A speech recognition method, comprising:
performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
performing character error detection on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
2. The method of claim 1, wherein the set of target speech recognition models comprises speech recognition models in at least two different contexts.
3. The method of claim 1, wherein performing speech recognition on the speech to be recognized based on the speech recognition model in the target speech recognition model set to obtain a candidate text set comprises:
and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
4. The method of claim 1, wherein the pre-training based character error detector performs character error detection on candidate texts in the candidate text set, comprising:
inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and taking the state vector of the candidate text as the input of a recurrent neural network, and inputting the state vector of the candidate text to a classifier after passing through a full-connection network to obtain the correct probability value of each character.
5. The method of claim 1, wherein determining the target text for performing speech recognition on the speech to be recognized based on the character error detection result of each candidate text comprises:
determining the character error rate of each candidate text;
and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.
6. The method of claim 1, further comprising:
acquiring user information of the voice to be recognized;
determining at least two speech recognition models from a plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.
7. The method of any of claims 1-6, wherein label building of the pre-trained character error detector training samples comprises:
acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;
the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.
8. A speech recognition apparatus, comprising:
the voice recognition module is used for carrying out voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
the detection module is used for carrying out character error detection on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and the determining module is used for determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing a speech recognition method according to any of claims 1 to 7.
10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the speech recognition method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011268976.9A CN112509565A (en) | 2020-11-13 | 2020-11-13 | Voice recognition method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011268976.9A CN112509565A (en) | 2020-11-13 | 2020-11-13 | Voice recognition method and device, electronic equipment and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112509565A true CN112509565A (en) | 2021-03-16 |
Family
ID=74957522
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011268976.9A Pending CN112509565A (en) | 2020-11-13 | 2020-11-13 | Voice recognition method and device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112509565A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898733A (en) * | 2022-05-06 | 2022-08-12 | 深圳妙月科技有限公司 | AI voice data analysis processing method and system |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018120889A1 (en) * | 2016-12-28 | 2018-07-05 | 平安科技(深圳)有限公司 | Input sentence error correction method and device, electronic device, and medium |
CN108874174A (en) * | 2018-05-29 | 2018-11-23 | 腾讯科技(深圳)有限公司 | A kind of text error correction method, device and relevant device |
CN108984745A (en) * | 2018-07-16 | 2018-12-11 | 福州大学 | A kind of neural network file classification method merging more knowledge mappings |
CN110148416A (en) * | 2019-04-23 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, equipment and storage medium |
CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Correction processing method and device, storage medium and processor |
CN110491383A (en) * | 2019-09-25 | 2019-11-22 | 北京声智科技有限公司 | A kind of voice interactive method, device, system, storage medium and processor |
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN111723791A (en) * | 2020-06-11 | 2020-09-29 | 腾讯科技(深圳)有限公司 | Character error correction method, device, equipment and storage medium |
CN111883110A (en) * | 2020-07-30 | 2020-11-03 | 上海携旅信息技术有限公司 | Acoustic model training method, system, device and medium for speech recognition |
CN111918136A (en) * | 2020-07-04 | 2020-11-10 | 中信银行股份有限公司 | Interest analysis method and device, storage medium and electronic equipment |
-
2020
- 2020-11-13 CN CN202011268976.9A patent/CN112509565A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018120889A1 (en) * | 2016-12-28 | 2018-07-05 | 平安科技(深圳)有限公司 | Input sentence error correction method and device, electronic device, and medium |
CN108874174A (en) * | 2018-05-29 | 2018-11-23 | 腾讯科技(深圳)有限公司 | A kind of text error correction method, device and relevant device |
CN108984745A (en) * | 2018-07-16 | 2018-12-11 | 福州大学 | A kind of neural network file classification method merging more knowledge mappings |
CN110148416A (en) * | 2019-04-23 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, equipment and storage medium |
CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Correction processing method and device, storage medium and processor |
CN110491383A (en) * | 2019-09-25 | 2019-11-22 | 北京声智科技有限公司 | A kind of voice interactive method, device, system, storage medium and processor |
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN111723791A (en) * | 2020-06-11 | 2020-09-29 | 腾讯科技(深圳)有限公司 | Character error correction method, device, equipment and storage medium |
CN111918136A (en) * | 2020-07-04 | 2020-11-10 | 中信银行股份有限公司 | Interest analysis method and device, storage medium and electronic equipment |
CN111883110A (en) * | 2020-07-30 | 2020-11-03 | 上海携旅信息技术有限公司 | Acoustic model training method, system, device and medium for speech recognition |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898733A (en) * | 2022-05-06 | 2022-08-12 | 深圳妙月科技有限公司 | AI voice data analysis processing method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729313B (en) | Deep neural network-based polyphone pronunciation distinguishing method and device | |
US7421387B2 (en) | Dynamic N-best algorithm to reduce recognition errors | |
Ferrer et al. | Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems | |
WO2020168752A1 (en) | Speech recognition and speech synthesis method and apparatus based on dual learning | |
CN112489626B (en) | Information identification method, device and storage medium | |
CN111145718A (en) | Chinese mandarin character-voice conversion method based on self-attention mechanism | |
JP2019159654A (en) | Time-series information learning system, method, and neural network model | |
CN117935785A (en) | Phoneme-based contextualization for cross-language speech recognition in an end-to-end model | |
CN111310441A (en) | Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition | |
CN110335608B (en) | Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium | |
CN112818089B (en) | Text phonetic notation method, electronic equipment and storage medium | |
WO2023030105A1 (en) | Natural language processing model training method and natural language processing method, and electronic device | |
CN112669845B (en) | Speech recognition result correction method and device, electronic equipment and storage medium | |
US20230104228A1 (en) | Joint Unsupervised and Supervised Training for Multilingual ASR | |
WO2023071581A1 (en) | Method and apparatus for determining response sentence, device, and medium | |
US20050187767A1 (en) | Dynamic N-best algorithm to reduce speech recognition errors | |
US20050197838A1 (en) | Method for text-to-pronunciation conversion capable of increasing the accuracy by re-scoring graphemes likely to be tagged erroneously | |
EP4295358A1 (en) | Lookup-table recurrent language model | |
CN114386399A (en) | Text error correction method and device | |
CN112509565A (en) | Voice recognition method and device, electronic equipment and readable storage medium | |
CN112632956A (en) | Text matching method, device, terminal and storage medium | |
Rajendran et al. | A robust syllable centric pronunciation model for Tamil text to speech synthesizer | |
CN116187304A (en) | Automatic text error correction algorithm and system based on improved BERT | |
CN111340117A (en) | CTC model training method, data processing method, device and storage medium | |
CN114707518B (en) | Semantic fragment-oriented target emotion analysis method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210316 |