CN112542162B - Speech recognition method, device, electronic equipment and readable storage medium - Google Patents
Speech recognition method, device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN112542162B CN112542162B CN202011402934.XA CN202011402934A CN112542162B CN 112542162 B CN112542162 B CN 112542162B CN 202011402934 A CN202011402934 A CN 202011402934A CN 112542162 B CN112542162 B CN 112542162B
- Authority
- CN
- China
- Prior art keywords
- candidate sentence
- probability
- candidate
- determining
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000003062 neural network model Methods 0.000 claims abstract description 45
- 230000006870 function Effects 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000010845 search algorithm Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/081—Search algorithms, e.g. Baum-Welch or Viterbi
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, which are applied to the technical field of voice recognition, wherein the method comprises the following steps: the pre-trained mask-based neural network model breaks through the limitation of the n-gram model, and context information of the whole sentence can be utilized, so that the category and probability of the text corresponding to each position in the candidate sentences can be obtained more accurately, the probability of each candidate sentence determined by cluster search is determined, and the candidate sentences are reordered, so that the voice recognition result is more accurate.
Description
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a device, an electronic apparatus, and a readable storage medium.
Background
The cluster search is a breadth-first heuristic search algorithm used in path searching. Assuming that three nodes are available, each with a possible value of abc, then all possible paths include aaa, aab, aac. For efficiency and storage space, the bundle search algorithm is first expanded from a width to create a candidate list with a maximum capacity of w, also commonly referred to as a bandwidth, i.e., the width of the bundle.
For the above problem, let w=2, i.e. the current list only retains the two most probable paths after each search. Then a complete bundle search procedure is as follows: firstly, considering the ordering of a, b and c, selecting two combination hypotheses b and c corresponding to the maximum probability, selecting two combinations with the maximum probability from the combination hypotheses b and c, arranging the combinations from high to low, and updating the combinations into a list; the second step considers the following 6 cases, ba, bb, bc, ca, cb, cc, from which the two combinations of highest probabilities are selected and arranged from high to low, assuming bc, ca, updated in the list; the third step considers the following 6 cases, bca, bcb, bcc, caa, cab, cac, from which the two combinations with the highest probability are selected and arranged from high to low, assuming caa, cac; ending the search and outputting caa, cac as the final bundle search result.
The probabilities between the combinations involved in the above calculation process can be obtained by an n-gram language model. Taking 2-gram as an example, the frequency of the combination of 2 nd order words is usually calculated from a large number of corpora to represent the probability of the combination. Assuming a total of three words, a, b, c, then the 2-gram will obtain the probability value for the following combination by counting a large number of text corpora:
a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc. The probability calculation in the searching process is then obtained by means of a table look-up, for example, the probability of calculating the abc combination is decomposed into the modulus values ab and bc, and the modulus values bc are multiplied.
The bundle search enhances the voice recognition effect through the language model of the n-gram, and the ngram is realized through table lookup. In practical applications, for an audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are ordered from high to low according to probabilities, and the values of the probabilities are obtained through probability weighting of the acoustic model and the ngram language model respectively. The n-gram language model belongs to a local model, and has the advantages of high efficiency and the disadvantage of being incapable of realizing longer context understanding. The disadvantage is that the n-gram model is difficult to model long sentences, context information of the whole sentence cannot be utilized, and the understanding of the context is often not accurate enough.
Disclosure of Invention
The application provides a voice recognition method, a voice recognition device, an electronic device and a readable storage medium, which are used for breaking through the limitation of an n-gram model, and can utilize the context information of a whole sentence so as to more accurately obtain the category and probability of characters corresponding to each position in candidate sentences, further determine the probability of each candidate sentence determined by bundling search and reorder the candidate sentences, so that the voice recognition result is more accurate.
The technical scheme adopted by the application is as follows:
in a first aspect, a method for voice recognition is provided, which is characterized by comprising:
acquiring a candidate sentence list obtained by carrying out voice recognition on target audio based on a bundle search method, wherein the candidate sentence list comprises a plurality of candidate sentences;
determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.
Optionally, determining the probability of each word occurrence in any candidate sentence based on the pre-trained mask-based neural network model includes:
determining word categories and probabilities of all positions based on a pre-trained mask-based neural network model;
determining the probability of each word in any candidate sentence based on the word category and the probability of each position;
determining the probability of any candidate sentence based on the probability of each word occurrence in the candidate sentence, comprising:
and taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
Optionally, calculating the text category and probability for each location based on the pre-trained mask-based neural network model includes:
erasing the characters at any position in a mask mode to obtain any candidate sentence of the erased characters at any position;
and inputting any candidate sentence which is erased by the characters at any position into a pre-trained mask-based neural network model to obtain the character category and probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of the mask wipe.
Optionally, the pre-trained mask-based neural network model is a time series based neural network model.
Optionally, the method further comprises:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
In a second aspect, there is provided a speech recognition apparatus comprising:
the acquisition module is used for acquiring a candidate sentence list obtained by carrying out voice recognition on the target audio based on the bundle search model, wherein the candidate sentence list comprises a plurality of candidate sentences;
the determining module is used for determining the probability of each candidate sentence in the candidate sentence list; the determining module is specifically used for determining the probability of each word in any candidate sentence based on the pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
and the reordering module is used for reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list, so as to obtain a reordered target candidate sentence list.
Optionally, the determining module includes:
a first determining unit, configured to determine a text category and a probability of each location based on a pre-trained mask-based neural network model;
the second determining unit is used for determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
and the unit is used for taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
Optionally, the first determining unit is specifically configured to erase the text at any position by using a mask mode, so as to obtain any candidate sentence of the text erase at any position; and inputting any candidate sentence which is used for erasing the characters at any position into the pre-trained mask-based neural network model to obtain the character category and the probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of the mask wipe.
Optionally, the pre-trained mask-based neural network model is a time series based neural network model.
Optionally, the apparatus further comprises: and the module is used for taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
In a third aspect, an electronic device is provided, the electronic device comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: the speech recognition method shown in the first aspect is performed.
In a fourth aspect, there is provided a computer readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the speech recognition method of the first aspect.
Compared with the prior art that the cluster search is difficult to model long sentences through voice recognition by an n-gram language model, context information of the whole sentences cannot be utilized, and the understanding of the context is usually inaccurate, the method, the device and the computer program product provide a candidate sentence list obtained by acquiring a candidate sentence list obtained by voice recognition of target audio based on the cluster search method, wherein the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The method breaks through the limitation of an n-gram model through a pre-trained mask-based neural network model, and can utilize the context information of the whole sentence, so that the category and probability of the text corresponding to each position in the candidate sentence can be obtained more accurately, the probability of each candidate sentence can be determined, the candidate sentences can be reordered, and the voice recognition result is more accurate.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an example probability identification determination according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example 1
The embodiment of the application provides a voice recognition method, as shown in fig. 1, the method may include the following steps:
step S101, a candidate sentence list obtained by carrying out voice recognition on target audio based on a bundle search method is obtained, wherein the candidate sentence list comprises a plurality of candidate sentences;
specifically, in practical applications, for a target audio input, the bundle search outputs a list of sentences, each sentence in the list representing a possible transcription result. The sentences of the list are ordered from high to low according to probabilities, and the values of the probabilities are obtained through probability weighting of the acoustic model and the n-gram language model respectively. The n-gram language model belongs to a local model, and has the advantages of high efficiency and the disadvantage of being incapable of realizing longer context understanding. The disadvantage is that the n-gram model is difficult to model long sentences, context information of the whole sentence cannot be utilized, and the understanding of the context is often not accurate enough.
Such as for a phonetic text: "suddenly raining today, but i forget to get out of the way with an umbrella", the bundle search algorithm may get the following list of outputs, provided that the last two word recordings are not clear enough due to noise or other reasons:
"today suddenly rains, but I have no three"
"today suddenly rains, but I have no stay in the shed"
"today suddenly rains, but I have no umbrella"
…
The difference between the output sentences in the list is that the last two words, because of the 6 characters of the umbrella and the rain, if the context relation is needed to be utilized in the ngram language model, the length of the context is required to be at least 10 (including the 4 words of the umbrella and the rain), that is, a language model of 10 th order is required, which is obviously unrealistic (the maximum of the normal n-gram language model can reach 5 th order), and how the limitation of the n-gram model can be broken becomes a problem. It is the use of a pre-trained mask-based neural network language model to reorder and necessarily correct the list of bundle search outputs described above.
Step S102, determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
that is, the probability of each word in any candidate sentence is determined through the pre-trained mask-based neural network model, so that the limitation that the n-gram model cannot utilize the context information of the whole sentence and the understanding of the context is usually inaccurate is broken through.
Specifically, the pre-trained Mask-based neural network model of the present application may be a language-side model (Mask LM) with a Mask, or may be another model for implementing the functions of the present application. The training process of Mask LM is similar to the blank question of text, the input sentence I with blank is processed through the deep network and then the language P is output, and the aim is that the closer P is to the true sentence T, the better. As an example of this is shown in the following,
i if there is [ ] you only need right key [ ] to click each disk character, select "[ ] chemical [ ] i.e [ ]
P, if yes, you only need right click each drive letter, select "format
T, if yes, you only need right click each drive letter, select "format
Through training, the mask LM can predict the characters at any position. The implementation of Mask LM networks is not limited to RNN/GRU/LSTM isochronous models or modified versions of their attention machines, but also includes network structures such as transformers (BERT, gpt).
Step S103, based on the probability of each candidate sentence in the determined candidate sentence list, reordering each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list.
The sentences are obtained by way of example respectively, namely the probability that the sentences are suddenly rained today, but I do not have three, the sentences are suddenly rained today, the sentences are not scattered, the sentences are suddenly rained today, the probabilities that I do not have umbrella are obtained, and then the sentences can be reordered according to the size of the probability value, so that the obtained target candidate sentence results are more accurate.
The embodiment of the application provides a possible implementation manner, specifically, determining the probability of each word occurrence in any candidate sentence based on a pre-trained mask-based neural network model, which comprises the following steps:
determining word categories and probabilities of all positions based on a pre-trained mask-based neural network model; specifically, the characters at any position are wiped off in a mask mode, and any candidate sentence with the characters at any position wiped off is obtained; and inputting any candidate sentence which is erased by the characters at any position into a pre-trained mask-based neural network model to obtain the character category and probability at any position. Specifically, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of mask smear. Illustratively, as in fig. 3, i love-state, "-" location correspondence words may be "middle, united, large.
Determining the probability of each word in any candidate sentence based on the word category and the probability of each position;
determining the probability of any candidate sentence based on the probability of each word occurrence in the candidate sentence, comprising:
and taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
The embodiment of the application provides a possible implementation manner, and particularly, the pre-trained mask-based neural network model is a neural network model based on a time sequence.
Specifically, for each character in the sentence, it is erased by masking, i.e., replaced with "-" (or other non-visible symbol). And predicting all possible characters (including actual characters) of the position and probability values thereof through the mask LM, and finally taking the probability corresponding to the actual text of the position as a score corresponding to the position in the sentence. The final total score is the product of the scoring values (probability values) for all the positions.
Illustratively, to calculate the probability of the statement "puncture or very foreign", the following decomposition is performed:
wearing wearStill very air: 0.1
Wearing wearStarting upAlso very air: 0.2
Is put onAnd also (3) the methodIs very air-tight: 0.3
Put on and put backIs thatVery air-borne: 0.6
Whether to put on or put offVery muchOcean-gas type: 0.7
Is very worn upOcean typeAir: 0.2
Whether worn or not to be very foreignAir flowIs a combination of the above: 0.3
Whether to get on or get very youngA kind of electronic device:0.5
For each position of the sentence, a probability value of the word is calculated (as indicated by the numerical value following the colon), and finally all values are multiplied to obtain the final probability value of the sentence. Representing the probability value of a sentence by P(s)
Wherein n is the length of a sentence, i is a position, the probability value of the text corresponding to the position i is Pi, and the value of Pi is obtained by predicting the text at other positions by a language model.
Fig. 3 shows how Pi of the sentence "i love china" is calculated, and the probability of the word "in the sentence is 0.99, i.e. the probability value output by the softmax of the last layer. If the sentence is "I love you," then the probability value of "you" is 0.0001.
Sentence reordering, in which probabilities P (S1), P (S2), … P (Sm) are calculated for sentence lists obtained by bundle search, assuming S1, S2, S3, … Sm, respectively.
And finally, sorting the sentences from large to small according to the probability value to obtain the reordered sentences.
The embodiment of the application provides a possible implementation manner, and specifically, the method further comprises the following steps:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
Compared with the prior art that the bundling search is used for carrying out voice recognition through an n-gram language model, the long sentence is difficult to model, the context information of the whole sentence cannot be utilized, and the understanding of the context is generally inaccurate, the method is used for obtaining a candidate sentence list obtained by carrying out voice recognition on target audio based on the bundling search method, and the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The method breaks through the limitation of an n-gram model through a pre-trained mask-based neural network model, and can utilize the context information of the whole sentence, so that the category and probability of the text corresponding to each position in the candidate sentence can be obtained more accurately, the probability of each candidate sentence can be determined, the candidate sentences can be reordered, and the voice recognition result is more accurate.
Example two
Fig. 2 is a voice recognition device according to an embodiment of the present application, where the device 20 includes: an acquisition module 201, a determination module 202, a reordering module 203, wherein,
an obtaining module 201, configured to obtain a candidate sentence list obtained by performing speech recognition on a target audio based on a bundle search model, where the candidate sentence list includes a plurality of candidate sentences;
a determining module 202, configured to determine a probability of each candidate sentence in the candidate sentence list; the determining module is specifically used for determining the probability of each word in any candidate sentence based on the pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence;
and the reordering module 203 is configured to reorder each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list, so as to obtain a reordered target candidate sentence list.
Optionally, the determining module includes:
a first determining unit, configured to determine a text category and a probability of each location based on a pre-trained mask-based neural network model;
the second determining unit is used for determining the probability of each character in any candidate sentence based on the character category and the probability of each position;
and the unit is used for taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
Optionally, the first determining unit is specifically configured to erase the text at any position by using a mask mode, so as to obtain any candidate sentence of the text erase at any position; and inputting any candidate sentence which is used for erasing the characters at any position into the pre-trained mask-based neural network model to obtain the character category and the probability at any position.
Optionally, the last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of the mask wipe.
Optionally, the pre-trained mask-based neural network model is a time series based neural network model.
Optionally, the apparatus 20 further includes a module configured to use the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a speech recognition result of the target audio.
Compared with the prior art that the cluster search is used for carrying out voice recognition through an n-gram language model, the long sentence is difficult to model, the context information of the whole sentence cannot be utilized, and the understanding of the context is generally inaccurate, the voice recognition device is used for obtaining a candidate sentence list obtained by carrying out voice recognition on target audio based on a cluster search method, and the candidate sentence list comprises a plurality of candidate sentences; determining the probability of each candidate sentence in the candidate sentence list; determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in any candidate sentence; and reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list. The method breaks through the limitation of an n-gram model through a pre-trained mask-based neural network model, and can utilize the context information of the whole sentence, so that the category and probability of the text corresponding to each position in the candidate sentence can be obtained more accurately, the probability of each candidate sentence can be determined, the candidate sentences can be reordered, and the voice recognition result is more accurate.
The apparatus of the embodiment of the present application may perform the method shown in the first embodiment of the present application, and the effect achieved by the method is similar, and will not be described herein.
Example III
The embodiment of the application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Processor 401 is connected to memory 403, such as via bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that, in practical applications, the transceiver 404 is not limited to one, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in the embodiment of the present application to implement the functions of the modules shown in fig. 2. Transceiver 404 includes a receiver and a transmitter.
The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. Processor 401 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 402 may include a path to transfer information between the components. Bus 402 may be a PCI bus, an EISA bus, or the like. Bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disks, laser disks, optical disks, digital versatile disks, blu-ray disks, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the present application and is controlled to be executed by the processor 401. The processor 401 is arranged to execute application code stored in the memory 403 to implement the functions of the apparatus provided by the embodiment shown in fig. 2.
The embodiment of the application provides an electronic device suitable for the embodiment of the method, and specific implementation manner and technical effects are not described herein.
Example IV
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method shown in the above embodiment.
The embodiment of the present application provides a computer readable storage medium suitable for the above method embodiment, and specific implementation manner and technical effects are not described herein.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present application and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.
Claims (7)
1. A method of speech recognition, comprising:
acquiring a candidate sentence list obtained by carrying out voice recognition on target audio based on a bundle searching method, wherein the candidate sentence list comprises a plurality of candidate sentences;
determining the probability of each candidate sentence in the candidate sentence list; the determining the probability of each candidate sentence in the candidate sentence list comprises the following steps: determining the probability of each word in any candidate sentence based on a pre-trained mask-based neural network model, and determining the probability of any candidate sentence based on the probability of each word in the any candidate sentence;
based on the determined probability of each candidate sentence in the candidate sentence list, reordering each candidate sentence in the candidate sentence list to obtain a reordered target candidate sentence list;
the pre-training mask-based neural network model determines the probability of each word in any candidate sentence, and comprises the following steps:
determining word categories and probabilities for each location based on a pre-trained mask-based neural network model, comprising: erasing the characters at any position in a mask mode to obtain any candidate sentence of the erased characters at any position; inputting any candidate sentence erased by the characters at any position into the pre-trained mask-based neural network model to obtain the character category and probability at any position;
determining the probability of each word occurrence in any candidate sentence based on the word category and the probability of each position;
the determining the probability of any candidate sentence based on the probability of each word in the candidate sentence comprises:
and taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence.
2. The method of claim 1, wherein a last layer of the pre-trained mask-based neural network model is a softmax activation function for classifying text corresponding to the location of mask smear.
3. The method of any of claims 1-2, wherein the pre-trained mask-based neural network model is a time series-based neural network model.
4. A method according to claim 3, characterized in that the method further comprises:
and taking the target candidate sentence with the highest probability value in the reordered target candidate sentence list as a voice recognition result of the target audio.
5. A speech recognition model, comprising:
the acquisition module is used for acquiring a candidate sentence list obtained by carrying out voice recognition on the target audio based on the bundle search model, wherein the candidate sentence list comprises a plurality of candidate sentences;
the determining module is used for determining the probability of each candidate sentence in the candidate sentence list; the determining module is specifically configured to determine, based on a pre-trained mask-based neural network model, a probability of occurrence of each word in any candidate sentence, and determine, based on the probability of occurrence of each word in the any candidate sentence, a probability of occurrence of the any candidate sentence; the determining module includes: a first determining unit, configured to determine a text category and a probability of each location based on a pre-trained mask-based neural network model; the second determining unit is used for determining the occurrence probability of each word in any candidate sentence based on the word category and the probability of each position; the unit is used for taking the probability value products of the characters at all the positions in the candidate sentences as the probability of any candidate sentence;
and the reordering module is used for reordering each candidate sentence in the candidate sentence list based on the determined probability of each candidate sentence in the candidate sentence list, so as to obtain a reordered target candidate sentence list.
6. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: a speech recognition method according to any one of claims 1 to 4 is performed.
7. A computer readable storage medium storing computer instructions which, when run on a computer, cause the computer to perform the speech recognition method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011402934.XA CN112542162B (en) | 2020-12-04 | 2020-12-04 | Speech recognition method, device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011402934.XA CN112542162B (en) | 2020-12-04 | 2020-12-04 | Speech recognition method, device, electronic equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112542162A CN112542162A (en) | 2021-03-23 |
CN112542162B true CN112542162B (en) | 2023-07-21 |
Family
ID=75015789
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011402934.XA Active CN112542162B (en) | 2020-12-04 | 2020-12-04 | Speech recognition method, device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112542162B (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011075602A (en) * | 2009-09-29 | 2011-04-14 | Brother Industries Ltd | Device, method and program for speech recognition |
JP6727607B2 (en) * | 2016-06-09 | 2020-07-22 | 国立研究開発法人情報通信研究機構 | Speech recognition device and computer program |
CN110517693B (en) * | 2019-08-01 | 2022-03-04 | 出门问问(苏州)信息科技有限公司 | Speech recognition method, speech recognition device, electronic equipment and computer-readable storage medium |
CN111145729B (en) * | 2019-12-23 | 2022-10-28 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN112017645B (en) * | 2020-08-31 | 2024-04-26 | 广州市百果园信息技术有限公司 | Voice recognition method and device |
CN111933129B (en) * | 2020-09-11 | 2021-01-05 | 腾讯科技(深圳)有限公司 | Audio processing method, language model training method and device and computer equipment |
-
2020
- 2020-12-04 CN CN202011402934.XA patent/CN112542162B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN112542162A (en) | 2021-03-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4594551B2 (en) | Document image decoding method using integrated probabilistic language model | |
CN101661462B (en) | Four-layer structure Chinese text regularized system and realization thereof | |
EP1619620A1 (en) | Adaptation of Exponential Models | |
CN110580335A (en) | user intention determination method and device | |
JP2011503638A (en) | Improvement of free conversation command classification for car navigation system | |
EP1178466B1 (en) | Recognition system using lexical trees | |
CN103854643A (en) | Method and apparatus for speech synthesis | |
CN111414757B (en) | Text recognition method and device | |
US8019593B2 (en) | Method and apparatus for generating features through logical and functional operations | |
CN113642316B (en) | Chinese text error correction method and device, electronic equipment and storage medium | |
US6507815B1 (en) | Speech recognition apparatus and method | |
CN111444719A (en) | Entity identification method and device and computing equipment | |
CN112185361A (en) | Speech recognition model training method and device, electronic equipment and storage medium | |
CN111428487B (en) | Model training method, lyric generation method, device, electronic equipment and medium | |
CN112542162B (en) | Speech recognition method, device, electronic equipment and readable storage medium | |
CN117112916A (en) | Financial information query method, device and storage medium based on Internet of vehicles | |
JP6220762B2 (en) | Next utterance candidate scoring device, method, and program | |
CN113268571A (en) | Method, device, equipment and medium for determining correct answer position in paragraph | |
CN110708619B (en) | Word vector training method and device for intelligent equipment | |
CN112927695A (en) | Voice recognition method, device, equipment and storage medium | |
CN115035890B (en) | Training method and device of voice recognition model, electronic equipment and storage medium | |
CN116187304A (en) | Automatic text error correction algorithm and system based on improved BERT | |
JP2017167938A (en) | Learning device, learning method, and program | |
CN113450805B (en) | Automatic speech recognition method and device based on neural network and readable storage medium | |
JP6558856B2 (en) | Morphological analyzer, model learning device, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |