CN112542162B - Speech recognition method, device, electronic equipment and readable storage medium - Google Patents

Speech recognition method, device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112542162B
CN112542162B CN202011402934.XA CN202011402934A CN112542162B CN 112542162 B CN112542162 B CN 112542162B CN 202011402934 A CN202011402934 A CN 202011402934A CN 112542162 B CN112542162 B CN 112542162B
Authority
CN
China
Prior art keywords
candidate sentence
probability
candidate
determining
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011402934.XA
Other languages
Chinese (zh)
Other versions
CN112542162A (en
Inventor
赖勇铨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Citic Bank Corp Ltd
Original Assignee
China Citic Bank Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Citic Bank Corp Ltd filed Critical China Citic Bank Corp Ltd
Priority to CN202011402934.XA priority Critical patent/CN112542162B/en
Publication of CN112542162A publication Critical patent/CN112542162A/en
Application granted granted Critical
Publication of CN112542162B publication Critical patent/CN112542162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: the pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and probability of the text corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence determined by cluster search is determined, and the candidate sentences are reordered, so that the voice recognition result is more accurate.

Description

Speech recognition method, device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a device, an electronic apparatus, and a readable storage medium.
Background
The cluster search is a breadth-first heuristic search algorithm used in path searching. Assuming that three nodes are available, each with a possible value of abc, then all possible paths include aaa, aab, aac. For efficiency and storage space, the bundle search algorithm is first expanded from a width to create a candidate list with a maximum capacity of w, also commonly referred to as a bandwidth, i.e., the width of the bundle.
For the above problem, let w=2, i.e. the current list only retains the two most probable paths after each search. Then a complete bundle search procedure is as follows: firstly, considering the ordering of a, b and c, selecting two combination hypotheses b and c corresponding to the maximum probability, selecting two combinations with the maximum probability from the combination hypotheses b and c, arranging the combinations from high to low, and updating the combinations into a list; the second step considers the following 6 cases, ba, bb, bc, ca, cb, cc, from which the two combinations of highest probabilities are selected and arranged from high to low, assuming bc, ca, updated in the list; the third step considers the following 6 cases, bca, bcb, bcc, caa, cab, cac, from which the two combinations with the highest probability are selected and arranged from high to low, assuming caa, cac; ending the search and outputting caa, cac as the final bundle search result.
The probabilities between the combinations involved in the above calculation process can be obtained by an n-gram language model. Taking 2-gram as an example, the frequency of the combination of 2 nd order words is usually calculated from a large number of corpora to represent the probability of the combination. Assuming a total of three words, a, b, c, then the 2-gram will obtain the probability value for the following combination by counting a large number of text corpora:
a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc. The probability calculation in the searching process is then obtained by means of a table look-up, for example, the probability of calculating the abc combination is decomposed into the modulus values ab and bc, and the modulus values bc are multiplied.
The bundle search enhances the voice recognition effect through the language model of the n-gram, and the ngram is realized through table lookup. In practical applications, for an audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are ordered from high to low according to probabilities, and the values of the probabilities are obtained through probability weighting of the acoustic model and the ngram language model respectively. The n-gram language model belongs to a local model, and has the advantages of high efficiency and the disadvantage of being incapable of realizing longer context understanding. The disadvantage is that the n-gram model is difficult to model long sentences, context information of the whole sentence cannot be utilized, and the understanding of the context is often not accurate enough.
Disclosure of Invention
The application provides a voice recognition method, a voice recognition device, an electronic device and a readable storage medium, which are used for breaking through the limitation of an n-gram model, and can utilize the context information of a whole sentence so as to more accurately obtain the category and probability of characters corresponding to each position in candidate sentences, further determine the probability of each candidate sentence determined by bundling search and reorder the candidate sentences, so that the voice recognition result is more accurate.
The technical scheme adopted by the application is as follows:
in a first aspect, a method for voice recognition is provided, which is characterized by comprising:
acquiring a candidate sentence list obtained by carrying out voice recognition on target audio based on a bundle search method, wherein the candidate sentence list comprises a plurality of candidate sentences;
determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.
Optionally, determining the probability of each word occurrence in any candidate sentence based on the pre-trained mask-based neural network model includes:
determining word categories and probabilities of all positions based on a pre-trained mask-based neural network model;
determining the probability of each word in any candidate sentence based on the word category and the probability of each position;
determining the probability of any candidate sentence based on the probability of each word occurrence in the candidate sentence, comprising:
and taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
Optionally, calculating the text category and probability for each location based on the pre-trained mask-based neural network model includes:
erasing the characters at any position in a mask mode to obtain any candidate sentence of the erased characters at any position;
and inputting any candidate sentence which is erased by the characters at any position into a pre-trained mask-based neural network model to obtain the character category and probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of the mask wipe.
Optionally, the pre-trained mask-based neural network model is a time series based neural network model.
Optionally, the method further comprises:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
In a second aspect, there is provided a speech recognition apparatus comprising:
the acquisition module is used for acquiring a candidate sentence list obtained by carrying out voice recognition on the target audio based on the bundle search model, wherein the candidate sentence list comprises a plurality of candidate sentences;
the determining module is used for determining the probability of each candidate sentence in the candidate sentence list; the determining module is specifically used for determining the probability of each word in any candidate sentence based on the pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
and the reordering module is used for reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list, so as to obtain a reordered target candidate sentence list.
Optionally, the determining module includes:
a first determining unit, configured to determine a text category and a probability of each location based on a pre-trained mask-based neural network model;
the second determining unit is used for determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
and the unit is used for taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
Optionally, the first determining unit is specifically configured to erase the text at any position by using a mask mode, so as to obtain any candidate sentence of the text erase at any position; and inputting any candidate sentence which is used for erasing the characters at any position into the pre-trained mask-based neural network model to obtain the character category and the probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of the mask wipe.
Optionally, the pre-trained mask-based neural network model is a time series based neural network model.
Optionally, the apparatus further comprises: and the module is used for taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
In a third aspect, an electronic device is provided, the electronic device comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: the speech recognition method shown in the first aspect is performed.
In a fourth aspect, there is provided a computer readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the speech recognition method of the first aspect.
Compared with the prior art that the cluster search is difficult to model long sentences through voice recognition by an n-gram language model, context information of the whole sentences cannot be utilized, and the understanding of the context is usually inaccurate, the method, the device and the computer program product provide a candidate sentence list obtained by acquiring a candidate sentence list obtained by voice recognition of target audio based on the cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The method breaks through the limitation of an n-gram model through a pre-trained mask-based neural network model, and can utilize the context information of the whole sentence, so that the category and probability of the text corresponding to each position in the candidate sentence can be obtained more accurately, the probability of each candidate sentence can be determined, the candidate sentences can be reordered, and the voice recognition result is more accurate.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an example probability identification determination according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example 1
The embodiment of the application provides a voice recognition method, as shown in fig. 1, the method may include the following steps:
step S101, a candidate sentence list obtained by carrying out voice recognition on target audio based on a bundle search method is obtained, wherein the candidate sentence list comprises a plurality of candidate sentences;
specifically, in practical applications, for a target audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are ordered from high to low according to probabilities, and the values of the probabilities are obtained through probability weighting of the acoustic model and the n-gram language model respectively. The n-gram language model belongs to a local model, and has the advantages of high efficiency and the disadvantage of being incapable of realizing longer context understanding. The disadvantage is that the n-gram model is difficult to model long sentences, context information of the whole sentence cannot be utilized, and the understanding of the context is often not accurate enough.
Such as for a phonetic text: "suddenly raining today, but i forget to get out of the way with an umbrella", the bundle search algorithm may get the following list of outputs, provided that the last two word recordings are not clear enough due to noise or other reasons:
"today suddenly rains, but I have no three"
"today suddenly rains, but I have no stay in the shed"
"today suddenly rains, but I have no umbrella"
The difference between the output sentences in the list is that the last two words, because of the 6 characters of the umbrella and the rain, if the context relation is needed to be utilized in the ngram language model, the length of the context is required to be at least 10 (including the 4 words of the umbrella and the rain), that is, a language model of 10 th order is required, which is obviously unrealistic (the maximum of the normal n-gram language model can reach 5 th order), and how the limitation of the n-gram model can be broken becomes a problem. It is the use of a pre-trained mask-based neural network language model to reorder and necessarily correct the list of bundle search outputs described above.
Step S102, determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
that is, the probability of each word in any candidate sentence is determined through the pre-trained mask-based neural network model, so that the limitation that the n-gram model cannot utilize the context information of the whole sentence and the understanding of the context is usually inaccurate is broken through.
Specifically, the pre-trained Mask-based neural network model of the present application may be a language-side model (Mask LM) with a Mask, or may be another model for implementing the functions of the present application. The training process of Mask LM is similar to the blank question of text, the input sentence I with blank is processed through the deep network and then the language P is output, and the aim is that the closer P is to the true sentence T, the better. As an example of this is shown in the following,
i if there is [ ] you only need right key [ ] to click each disk character, select "[ ] chemical [ ] i.e [ ]
P, if yes, you only need right click each drive letter, select "format
T, if yes, you only need right click each drive letter, select "format
Through training, the mask LM can predict the characters at any position. The implementation of Mask LM networks is not limited to RNN/GRU/LSTM isochronous models or modified versions of their attention machines, but also includes network structures such as transformers (BERT, gpt).
Step S103, based on the probability of each candidate sentence in the determined candidate sentence list, reordering each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.
The sentences are obtained by way of example respectively, namely the probability that the sentences are suddenly rained today, but I do not have three, the sentences are suddenly rained today, the sentences are not scattered, the sentences are suddenly rained today, the probabilities that I do not have umbrella are obtained, and then the sentences can be reordered according to the size of the probability value, so that the obtained target candidate sentence results are more accurate.
The embodiment of the application provides a possible implementation manner, specifically, determining the probability of each word occurrence in any candidate sentence based on a pre-trained mask-based neural network model, which comprises the following steps:
determining word categories and probabilities of all positions based on a pre-trained mask-based neural network model; specifically, the characters at any position are wiped off in a mask mode, and any candidate sentence with the characters at any position wiped off is obtained; and inputting any candidate sentence which is erased by the characters at any position into a pre-trained mask-based neural network model to obtain the character category and probability at any position. Specifically, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of mask smear. Illustratively, as in fig. 3, i love-state, "-" location correspondence words may be "middle, united, large.
Determining the probability of each word in any candidate sentence based on the word category and the probability of each position;
determining the probability of any candidate sentence based on the probability of each word occurrence in the candidate sentence, comprising:
and taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
The embodiment of the application provides a possible implementation manner, and particularly, the pre-trained mask-based neural network model is a neural network model based on a time sequence.
Specifically, for each character in the sentence, it is erased by masking, i.e., replaced with "-" (or other non-visible symbol). And predicting all possible characters (including actual characters) of the position and probability values thereof through the mask LM, and finally taking the probability corresponding to the actual text of the position as a score corresponding to the position in the sentence. The final total score is the product of the scoring values (probability values) for all the positions.
Illustratively, to calculate the probability of the statement "puncture or very foreign", the following decomposition is performed:
wearing wearStill very air: 0.1
Wearing wearStarting upAlso very air: 0.2
Is put onAnd also (3) the methodIs very air-tight: 0.3
Put on and put backIs thatVery air-borne: 0.6
Whether to put on or put offVery muchOcean-gas type: 0.7
Is very worn upOcean typeAir: 0.2
Whether worn or not to be very foreignAir flowIs a combination of the above: 0.3
Whether to get on or get very youngA kind of electronic device:0.5
For each position of the sentence, a probability value of the word is calculated (as indicated by the numerical value following the colon), and finally all values are multiplied to obtain the final probability value of the sentence. Representing the probability value of a sentence by P(s)
Wherein n is the length of a sentence, i is a position, the probability value of the text corresponding to the position i is Pi, and the value of Pi is obtained by predicting the text at other positions by a language model.
Fig. 3 shows how Pi of the sentence "i love china" is calculated, and the probability of the word "in the sentence is 0.99, i.e. the probability value output by the softmax of the last layer. If the sentence is "I love you," then the probability value of "you" is 0.0001.
Sentence reordering, in which probabilities P (S1), P (S2), … P (Sm) are calculated for sentence lists obtained by bundle search, assuming S1, S2, S3, … Sm, respectively.
And finally, sorting the sentences from large to small according to the probability value to obtain the reordered sentences.
The embodiment of the application provides a possible implementation manner, and specifically, the method further comprises the following steps:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
Compared with the prior art that the bundling search is used for carrying out voice recognition through an n-gram language model, the long sentence is difficult to model, the context information of the whole sentence cannot be utilized, and the understanding of the context is generally inaccurate, the method is used for obtaining a candidate sentence list obtained by carrying out voice recognition on target audio based on the bundling search method, and the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The method breaks through the limitation of an n-gram model through a pre-trained mask-based neural network model, and can utilize the context information of the whole sentence, so that the category and probability of the text corresponding to each position in the candidate sentence can be obtained more accurately, the probability of each candidate sentence can be determined, the candidate sentences can be reordered, and the voice recognition result is more accurate.
Example two
Fig. 2 is a voice recognition device according to an embodiment of the present application, where the device 20 includes: an acquisition module 201, a determination module 202, a reordering module 203, wherein,
an obtaining module 201, configured to obtain a candidate sentence list obtained by performing speech recognition on a target audio based on a bundle search model, where the candidate sentence list includes a plurality of candidate sentences;
a determining module 202, configured to determine a probability of each candidate sentence in the candidate sentence list; the determining module is specifically used for determining the probability of each word in any candidate sentence based on the pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
and the reordering module 203 is configured to reorder each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list, so as to obtain a reordered target candidate sentence list.
Optionally, the determining module includes:
a first determining unit, configured to determine a text category and a probability of each location based on a pre-trained mask-based neural network model;
the second determining unit is used for determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
and the unit is used for taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
Optionally, the first determining unit is specifically configured to erase the text at any position by using a mask mode, so as to obtain any candidate sentence of the text erase at any position; and inputting any candidate sentence which is used for erasing the characters at any position into the pre-trained mask-based neural network model to obtain the character category and the probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of the mask wipe.
Optionally, the pre-trained mask-based neural network model is a time series based neural network model.
Optionally, the apparatus 20 further includes a module configured to use the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a speech recognition result of the target audio.
Compared with the prior art that the cluster search is used for carrying out voice recognition through an n-gram language model, the long sentence is difficult to model, the context information of the whole sentence cannot be utilized, and the understanding of the context is generally inaccurate, the voice recognition device is used for obtaining a candidate sentence list obtained by carrying out voice recognition on target audio based on a cluster search method, and the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The method breaks through the limitation of an n-gram model through a pre-trained mask-based neural network model, and can utilize the context information of the whole sentence, so that the category and probability of the text corresponding to each position in the candidate sentence can be obtained more accurately, the probability of each candidate sentence can be determined, the candidate sentences can be reordered, and the voice recognition result is more accurate.
The apparatus of the embodiment of the present application may perform the method shown in the first embodiment of the present application, and the effect achieved by the method is similar, and will not be described herein.
Example III
The embodiment of the application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Processor 401 is connected to memory 403, such as via bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that, in practical applications, the transceiver 404 is not limited to one, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application to implement the functions of the modules shown in fig. 2. Transceiver 404 includes a receiver and a transmitter.
The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. Processor 401 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 402 may include a path to transfer information between the components. Bus 402 may be a PCI bus, an EISA bus, or the like. Bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disks, laser disks, optical disks, digital versatile disks, blu-ray disks, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the present application and is controlled to be executed by the processor 401. The processor 401 is arranged to execute application code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.
The embodiment of the application provides an electronic device suitable for the embodiment of the method, and specific implementation manner and technical effects are not described herein.
Example IV
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method shown in the above embodiment.
The embodiment of the present application provides a computer readable storage medium suitable for the above method embodiment, and specific implementation manner and technical effects are not described herein.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present application and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (7)

1. A method of speech recognition, comprising:
acquiring a candidate sentence list obtained by carrying out voice recognition on target audio based on a bundle searching method, wherein the candidate sentence list comprises a plurality of candidate sentences;
determining the probability of each candidate sentence in the candidate sentence list; the determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in the any candidate sentence;
based on the determined probability of each candidate sentence in the candidate sentence list, reordering each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list;
the pre-training mask-based neural network model determines the probability of each word in any candidate sentence, and comprises the following steps:
determining word categories and probabilities for each location based on a pre-trained mask-based neural network model, comprising: erasing the characters at any position in a mask mode to obtain any candidate sentence of the erased characters at any position; inputting any candidate sentence erased by the characters at any position into the pre-trained mask-based neural network model to obtain the character category and probability at any position;
determining the probability of each word occurrence in any candidate sentence based on the word category and the probability of each position;
the determining the probability of any candidate sentence based on the probability of each word in the candidate sentence comprises:
and taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
2. The method of claim 1, wherein a last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of mask smear.
3. The method of any of claims 1-2, wherein the pre-trained mask-based neural network model is a time series-based neural network model.
4. A method according to claim 3, characterized in that the method further comprises:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
5. A speech recognition model, comprising:
the acquisition module is used for acquiring a candidate sentence list obtained by carrying out voice recognition on the target audio based on the bundle search model, wherein the candidate sentence list comprises a plurality of candidate sentences;
the determining module is used for determining the probability of each candidate sentence in the candidate sentence list; the determining module is specifically configured to determine, based on a pre-trained mask-based neural network model, a probability of occurrence of each word in any candidate sentence, and determine, based on the probability of occurrence of each word in the any candidate sentence, a probability of occurrence of the any candidate sentence; the determining module includes: a first determining unit, configured to determine a text category and a probability of each location based on a pre-trained mask-based neural network model; the second determining unit is used for determining the occurrence probability of each word in any candidate sentence based on the word category and the probability of each position; the unit is used for taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence;
and the reordering module is used for reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list, so as to obtain a reordered target candidate sentence list.
6. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: a speech recognition method according to any one of claims 1 to 4 is performed.
7. A computer readable storage medium storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method of any one of claims 1 to 4.
CN202011402934.XA 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium Active CN112542162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011402934.XA CN112542162B (en) 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011402934.XA CN112542162B (en) 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112542162A CN112542162A (en) 2021-03-23
CN112542162B true CN112542162B (en) 2023-07-21

Family

ID=75015789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011402934.XA Active CN112542162B (en) 2020-12-04 2020-12-04 Speech recognition method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112542162B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011075602A (en) * 2009-09-29 2011-04-14 Brother Industries Ltd Device, method and program for speech recognition
JP6727607B2 (en) * 2016-06-09 2020-07-22 国立研究開発法人情報通信研究機構 Speech recognition device and computer program
CN110517693B (en) * 2019-08-01 2022-03-04 出门问问(苏州)信息科技有限公司 Speech recognition method, speech recognition device, electronic equipment and computer-readable storage medium
CN111145729B (en) * 2019-12-23 2022-10-28 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN112017645B (en) * 2020-08-31 2024-04-26 广州市百果园信息技术有限公司 Voice recognition method and device
CN111933129B (en) * 2020-09-11 2021-01-05 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment

Also Published As

Publication number Publication date
CN112542162A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
JP4594551B2 (en) Document image decoding method using integrated probabilistic language model
CN101661462B (en) Four-layer structure Chinese text regularized system and realization thereof
EP1619620A1 (en) Adaptation of Exponential Models
CN110580335A (en) user intention determination method and device
JP2011503638A (en) Improvement of free conversation command classification for car navigation system
EP1178466B1 (en) Recognition system using lexical trees
CN103854643A (en) Method and apparatus for speech synthesis
CN111414757B (en) Text recognition method and device
US8019593B2 (en) Method and apparatus for generating features through logical and functional operations
CN113642316B (en) Chinese text error correction method and device, electronic equipment and storage medium
US6507815B1 (en) Speech recognition apparatus and method
CN111444719A (en) Entity identification method and device and computing equipment
CN112185361A (en) Speech recognition model training method and device, electronic equipment and storage medium
CN111428487B (en) Model training method, lyric generation method, device, electronic equipment and medium
CN112542162B (en) Speech recognition method, device, electronic equipment and readable storage medium
CN117112916A (en) Financial information query method, device and storage medium based on Internet of vehicles
JP6220762B2 (en) Next utterance candidate scoring device, method, and program
CN113268571A (en) Method, device, equipment and medium for determining correct answer position in paragraph
CN110708619B (en) Word vector training method and device for intelligent equipment
CN112927695A (en) Voice recognition method, device, equipment and storage medium
CN115035890B (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN116187304A (en) Automatic text error correction algorithm and system based on improved BERT
JP2017167938A (en) Learning device, learning method, and program
CN113450805B (en) Automatic speech recognition method and device based on neural network and readable storage medium
JP6558856B2 (en) Morphological analyzer, model learning device, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant