CN112509565A - Voice recognition method and device, electronic equipment and readable storage medium - Google Patents

Voice recognition method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112509565A
CN112509565A CN202011268976.9A CN202011268976A CN112509565A CN 112509565 A CN112509565 A CN 112509565A CN 202011268976 A CN202011268976 A CN 202011268976A CN 112509565 A CN112509565 A CN 112509565A
Authority
CN
China
Prior art keywords
candidate
candidate text
text
voice
voice recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011268976.9A
Other languages
Chinese (zh)
Inventor
赖勇铨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Citic Bank Corp Ltd
Original Assignee
China Citic Bank Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Citic Bank Corp Ltd filed Critical China Citic Bank Corp Ltd
Priority to CN202011268976.9A priority Critical patent/CN112509565A/en
Publication of CN112509565A publication Critical patent/CN112509565A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set, then performing character error detection on the candidate text in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text, and then determining the target text of the voice to be recognized based on the character error detection result of each candidate text. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.

Description

Voice recognition method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium.
Background
Speech recognition typically consists of two parts, a speech model and a language model. The speech model is responsible for converting the audio into a sequence of words and outputting corresponding probabilities, such as the pronunciation chi fan might be output as (eat 0.99, this 0.01), (rice 0.8, double 0.1), with the numbers representing the probability that the word matches the pronunciation. The language model is responsible for routing the output of the speech model, for example, the aforementioned outputs have four possible combinations: eating twice as a meal. The language model scores the four candidate combinations respectively, and finally selects 'eating' as final output by combining the probability of language grammar and the probability of pronunciation. If the chi fan face has subsequent pronunciations, such as the chi fan qian lai, then the language model will choose this "as output with a high probability, so that the last statement is" this coming ". It can be seen that the language model helps to solve the problem of text selection in speech recognition, especially when some ambiguous pronunciations are encountered, the language model is required to participate in the final decision.
In order to more accurately understand the intention of a speaker and realize accurate transcription, the prior art determines the context of a user before transcription, such as in a hospital environment, and uses a language model related to medicine, so that some professional terms can be more accurately translated. For example, google led the CN104508739B patent to use a larger general language model for basic transcription, and then add one step of context detection to the transcribed text, if the context is detected to be a medical environment, then use a medical language model. However, this method needs to solve the problem of a context classification, and it is difficult to ensure the accuracy of the transcription when the context is not clear or is fuzzy.
Disclosure of Invention
The application provides a speech recognition method, a speech recognition device, an electronic device and a readable storage medium, which are used for avoiding context detection, directly judging whether an output has errors through a character error detector (WED), and selecting a sentence with the least errors or the smallest error probability value as a final output according to the number and the probability of possible grammar errors.
The technical scheme adopted by the application is as follows:
in a first aspect, a speech recognition method is provided, which includes:
performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
performing character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
Optionally, the target set of speech recognition models comprises speech recognition models in at least two different contexts.
Optionally, performing speech recognition on the speech to be recognized based on the speech recognition model in the target speech recognition model set to obtain a candidate text set, including:
and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Optionally, the character error detection of the candidate texts in the candidate text set based on the pre-trained character error detector includes:
inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and taking the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text to the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Optionally, determining a target text for performing speech recognition on a speech to be recognized based on the character error detection result of each candidate text includes:
determining the character error rate of each candidate text;
and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.
Optionally, the method further comprises:
acquiring user information of a voice to be recognized;
at least two speech recognition models are determined from the plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.
Optionally, the label construction of the pre-trained character error detector training samples comprises:
acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;
the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.
In a second aspect, a speech recognition apparatus is provided, including:
the voice recognition module is used for carrying out voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
the detection module is used for carrying out character error detection on the candidate texts in the candidate text set based on the pre-trained character error detector to obtain a character error detection result of each candidate text;
and the determining module is used for determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
Optionally, the target set of speech recognition models comprises speech recognition models in at least two different contexts.
Optionally, the speech recognition module comprises:
and the voice recognition unit is used for carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Optionally, the detection module comprises:
the input unit is used for inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and the classification unit is used for inputting the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text into the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Optionally, the determining module includes:
a determining unit, configured to determine a character error rate of each candidate text;
and the unit is used for taking the candidate text with low character error rate as the target text for performing voice recognition on the voice to be recognized.
Optionally, the apparatus further comprises:
the first acquisition module is used for acquiring user information of the voice to be recognized;
and the screening and determining module is used for screening and determining at least two speech recognition models from the candidate speech recognition models based on the user information to obtain a target speech recognition model set.
Optionally, the apparatus further comprises:
the second acquisition module is used for acquiring the training text sample and replacing characters in the training text sample with a certain probability value;
and the setting module is used for setting the label corresponding to the position where the character is not changed to be 1 and setting the label corresponding to the position where the character is changed to be 0.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the speech recognition method shown in the first aspect is performed.
In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method shown in the first aspect.
Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice recognition method and device for the electronic equipment perform voice recognition on the voice to be recognized through the voice recognition model in the target voice recognition model set to obtain a candidate text set, then character error detection is performed on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then the target text of the voice to be recognized is determined based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an exemplary speech recognition process according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 5 is a diagram illustrating an exemplary character detection of a character error detector implemented in the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example one
An embodiment of the present application provides a speech recognition method, as shown in fig. 1, the method may include the following steps:
step S101, performing voice recognition on a voice to be recognized based on a voice recognition model in a target voice recognition model set to obtain a candidate text set; wherein the target speech recognition model set comprises speech recognition models in at least two different contexts, such as speech recognition models in medical aspect, speech recognition models in financial field, and the like. One speech recognition model recognizes the speech to be recognized, so that a plurality of candidate texts can be obtained, and one candidate text can also be obtained (the candidate text with the highest probability is selected from the possible candidate texts).
Step S102, character error detection is carried out on candidate texts in a candidate text set based on a pre-trained character error detector, and character error detection results of all the candidate texts are obtained;
the pre-trained character error detector is used for performing character error detection on the candidate texts in the candidate text set and determining whether the characters in the candidate texts are correct.
Step S103, determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
Specifically, the character error rate or the character correct rate of each candidate text may be counted, and the candidate text with the lowest error rate or the candidate text with the highest correct rate is determined as the target text for performing the speech recognition on the speech to be recognized.
Fig. 5 shows an exemplary flow chart of speech recognition in the embodiment of the present application, in which the speech to be recognized passes through an Acoustic Model (AM) to output different candidate chinese characters and their probability values, different LM models convert the output of each AM into a plurality of different sentences, and a detection model (WED) detects the sentences and finds the sentence with the least number of error words or the smallest probability.
Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice to be recognized is subjected to voice recognition through the voice recognition model based on the target voice recognition model set to obtain a candidate text set, then character error detection is carried out on candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then the target text of the voice to be recognized is determined based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
The embodiment of the present application provides a possible implementation manner, specifically, performing speech recognition on a to-be-recognized speech based on a speech recognition model in a target speech recognition model set to obtain a candidate text set, including:
and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Specifically, any speech recognition model may perform speech recognition on a speech to be recognized to obtain a plurality of candidate texts and a probability value of each candidate text, in this case, the number of the candidate texts obtained by one speech recognition model is a set of candidate texts respectively obtained by each speech recognition model; the candidate text with the highest probability value in the candidate texts can also be used as any speech recognition model to perform speech recognition on the speech to be recognized to obtain the candidate text, and in this case, one speech recognition model obtains one candidate text.
The embodiment of the present application provides a possible implementation manner, and specifically, performing character error detection on candidate texts in a candidate text set based on a pre-trained character error detector includes:
inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and taking the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text to the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Illustratively, fig. 3 shows an exemplary diagram of a character detection flow of a character error detector, wherein the character error detection adopts a pre-trained transform, and implements word-by-word detection in cooperation with an RNN model. The candidate text is input into a pre-trained transformer (e.g., a bert model) to obtain corresponding state vectors, where each Chinese character corresponds to one state vector, and each vector is usually 768 dimensions. These vectors are then used as the input to RNN (which may be LSTM or GRU), and finally the probability (e.g., 1, or 0) that each word is correct is obtained by passing through a full connection layer (FC) and taking softmax
The embodiment of the present application provides a possible implementation manner, and specifically, determining a target text for performing speech recognition on a speech to be recognized based on a character error detection result of each candidate text includes:
determining the character error rate of each candidate text;
and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.
The embodiment of the present application provides a possible implementation manner, and further, the method further includes:
acquiring user information of a voice to be recognized;
at least two speech recognition models are determined from the plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.
The embodiment of the present application provides a possible implementation manner, and specifically, the label construction of the training sample of the pre-trained character error detector includes:
acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;
the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.
Specifically, training of the character error detector is achieved through data augmentation, i.e., the construction of output sentences and labels (0/1 sequence) by a program. The input sentences can be obtained by collecting a plurality of network texts, for each row of sentences, the Chinese character characters or phrases in each row of sentences are replaced by a certain probability value, for the unchanged position, the corresponding output value is 1, otherwise, the output value is 0, and therefore a large number of training corpora and label construction thereof are achieved.
Illustratively, an example not taking the technical solution of the present application, according to the prior art, the same speech may output different sentences through two different language models, as shown in table 1:
TABLE 1
Figure BDA0002777086510000101
Taking the first set of sentences as an example, it is difficult for the algorithm to determine which sentence a/B is really intended by the user. The conventional method is to obtain a basic sentence through a general basic language model, and then determine the context in which the sentence expressed by the user is according to the basic sentence (here, it is required to determine the context in which the user is related to the pharmacy). And finally combining the pre-trained language model under the corresponding context with the output of the speech model (AM) to finally obtain a transcription result. The embodiment of the application adopts different ideas, and solves the ambiguity problem in speech recognition by traversing all language models and selecting the most suitable sentence from the output of all language models as the final transcription result, namely mismatching of the language models causes misuse of the vocabulary in the sentence or does not accord with the use habit.
Example two
Fig. 2 is a speech recognition apparatus according to an embodiment of the present application, where the apparatus 20 includes: a speech recognition module 201, a detection module 202, a determination module 20, wherein,
the voice recognition module 201 is configured to perform voice recognition on a to-be-recognized voice based on a voice recognition model in the target voice recognition model set to obtain a candidate text set;
the detection module 202 is configured to perform character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and the determining module 203 is configured to determine a target text for performing speech recognition on the speech to be recognized based on the character error detection result of each candidate text.
In particular, the target set of speech recognition models comprises speech recognition models in at least two different contexts.
Specifically, the speech recognition module includes:
and the voice recognition unit is used for carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
Specifically, the detection module includes:
the input unit is used for inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and the classification unit is used for inputting the state vector of the candidate text as the input of the recurrent neural network, and inputting the state vector of the candidate text into the classifier after passing through a full-connection network to obtain the correct probability value of each character.
Specifically, the determining module includes:
a determining unit, configured to determine a character error rate of each candidate text;
and the unit is used for taking the candidate text with low character error rate as the target text for performing voice recognition on the voice to be recognized.
Specifically, the apparatus further comprises:
the first acquisition module is used for acquiring user information of the voice to be recognized;
and the screening and determining module is used for screening and determining at least two speech recognition models from the candidate speech recognition models based on the user information to obtain a target speech recognition model set.
Specifically, the apparatus further comprises:
the second acquisition module is used for acquiring the training text sample and replacing characters in the training text sample with a certain probability value;
and the setting module is used for setting the label corresponding to the position where the character is not changed to be 1 and setting the label corresponding to the position where the character is changed to be 0.
Compared with the prior art that a language model is determined through context detection and then a voice recognition result is determined according to the determined language model, the voice recognition device performs voice recognition on voice to be recognized based on the voice recognition model in a target voice recognition model set to obtain a candidate text set, performs character error detection on candidate texts in the candidate text set based on a pre-trained character error detector to obtain character error detection results of all candidate texts, and then determines the target text of the voice to be recognized based on the character error detection results of all candidate texts. The method comprises the steps of recognizing a speech to be recognized through a plurality of speech recognition models to obtain a plurality of candidate texts, and determining a final target recognition text based on detection results of a character error detector on the plurality of candidate texts, so that context detection is not needed, and an accurate recognition result can be obtained even in the situation that the context is fuzzy.
The apparatus of the embodiment of the present application can execute the method shown in the first embodiment of the present application, and the implementation effect is similar, which is not described herein again.
EXAMPLE III
An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application, and is used to implement the functions of the modules shown in fig. 2. The transceiver 404 includes a receiver and a transmitter.
The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. The processor 401 is configured to execute application program code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.
The embodiment of the present application provides an electronic device suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.
Example four
The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the speech recognition method shown in the above embodiment.
The embodiment of the present application provides a computer-readable storage medium suitable for the above method embodiment, and specific implementation manners and technical effects are not described herein again.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A speech recognition method, comprising:
performing voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
performing character error detection on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
2. The method of claim 1, wherein the set of target speech recognition models comprises speech recognition models in at least two different contexts.
3. The method of claim 1, wherein performing speech recognition on the speech to be recognized based on the speech recognition model in the target speech recognition model set to obtain a candidate text set comprises:
and carrying out voice recognition on the voice to be recognized based on any voice recognition model in the target voice recognition model set to obtain at least one candidate text.
4. The method of claim 1, wherein the pre-training based character error detector performs character error detection on candidate texts in the candidate text set, comprising:
inputting the candidate text into a pre-trained Transformer network to obtain a state vector of the candidate text;
and taking the state vector of the candidate text as the input of a recurrent neural network, and inputting the state vector of the candidate text to a classifier after passing through a full-connection network to obtain the correct probability value of each character.
5. The method of claim 1, wherein determining the target text for performing speech recognition on the speech to be recognized based on the character error detection result of each candidate text comprises:
determining the character error rate of each candidate text;
and taking the candidate text with low character error rate as a target text for performing voice recognition on the voice to be recognized.
6. The method of claim 1, further comprising:
acquiring user information of the voice to be recognized;
determining at least two speech recognition models from a plurality of candidate speech recognition models based on the user information to obtain a target speech recognition model set.
7. The method of any of claims 1-6, wherein label building of the pre-trained character error detector training samples comprises:
acquiring a training text sample, and replacing characters in the training text sample with a certain probability value;
the label corresponding to the position where the character is not changed is set to 1, and the label of the position where the character is changed is set to 0.
8. A speech recognition apparatus, comprising:
the voice recognition module is used for carrying out voice recognition on the voice to be recognized based on the voice recognition model in the target voice recognition model set to obtain a candidate text set;
the detection module is used for carrying out character error detection on the candidate texts in the candidate text set based on a pre-trained character error detector to obtain a character error detection result of each candidate text;
and the determining module is used for determining a target text for performing voice recognition on the voice to be recognized based on the character error detection result of each candidate text.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing a speech recognition method according to any of claims 1 to 7.
10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the speech recognition method of any of claims 1 to 7.
CN202011268976.9A 2020-11-13 2020-11-13 Voice recognition method and device, electronic equipment and readable storage medium Pending CN112509565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011268976.9A CN112509565A (en) 2020-11-13 2020-11-13 Voice recognition method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011268976.9A CN112509565A (en) 2020-11-13 2020-11-13 Voice recognition method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112509565A true CN112509565A (en) 2021-03-16

Family

ID=74957522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011268976.9A Pending CN112509565A (en) 2020-11-13 2020-11-13 Voice recognition method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112509565A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898733A (en) * 2022-05-06 2022-08-12 深圳妙月科技有限公司 AI voice data analysis processing method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium
CN108874174A (en) * 2018-05-29 2018-11-23 腾讯科技(深圳)有限公司 A kind of text error correction method, device and relevant device
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Correction processing method and device, storage medium and processor
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN111723791A (en) * 2020-06-11 2020-09-29 腾讯科技(深圳)有限公司 Character error correction method, device, equipment and storage medium
CN111883110A (en) * 2020-07-30 2020-11-03 上海携旅信息技术有限公司 Acoustic model training method, system, device and medium for speech recognition
CN111918136A (en) * 2020-07-04 2020-11-10 中信银行股份有限公司 Interest analysis method and device, storage medium and electronic equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium
CN108874174A (en) * 2018-05-29 2018-11-23 腾讯科技(深圳)有限公司 A kind of text error correction method, device and relevant device
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings
CN110148416A (en) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 Audio recognition method, device, equipment and storage medium
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Correction processing method and device, storage medium and processor
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN111723791A (en) * 2020-06-11 2020-09-29 腾讯科技(深圳)有限公司 Character error correction method, device, equipment and storage medium
CN111918136A (en) * 2020-07-04 2020-11-10 中信银行股份有限公司 Interest analysis method and device, storage medium and electronic equipment
CN111883110A (en) * 2020-07-30 2020-11-03 上海携旅信息技术有限公司 Acoustic model training method, system, device and medium for speech recognition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898733A (en) * 2022-05-06 2022-08-12 深圳妙月科技有限公司 AI voice data analysis processing method and system

Similar Documents

Publication Publication Date Title
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
US7421387B2 (en) Dynamic N-best algorithm to reduce recognition errors
Ferrer et al. Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
WO2020168752A1 (en) Speech recognition and speech synthesis method and apparatus based on dual learning
CN112489626B (en) Information identification method, device and storage medium
CN111145718A (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
JP2019159654A (en) Time-series information learning system, method, and neural network model
CN117935785A (en) Phoneme-based contextualization for cross-language speech recognition in an end-to-end model
CN111310441A (en) Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition
CN110335608B (en) Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
CN112818089B (en) Text phonetic notation method, electronic equipment and storage medium
WO2023030105A1 (en) Natural language processing model training method and natural language processing method, and electronic device
CN112669845B (en) Speech recognition result correction method and device, electronic equipment and storage medium
US20230104228A1 (en) Joint Unsupervised and Supervised Training for Multilingual ASR
WO2023071581A1 (en) Method and apparatus for determining response sentence, device, and medium
US20050187767A1 (en) Dynamic N-best algorithm to reduce speech recognition errors
US20050197838A1 (en) Method for text-to-pronunciation conversion capable of increasing the accuracy by re-scoring graphemes likely to be tagged erroneously
EP4295358A1 (en) Lookup-table recurrent language model
CN114386399A (en) Text error correction method and device
CN112509565A (en) Voice recognition method and device, electronic equipment and readable storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
Rajendran et al. A robust syllable centric pronunciation model for Tamil text to speech synthesizer
CN116187304A (en) Automatic text error correction algorithm and system based on improved BERT
CN111340117A (en) CTC model training method, data processing method, device and storage medium
CN114707518B (en) Semantic fragment-oriented target emotion analysis method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210316