CN114360510A - Voice recognition method and related device - Google Patents
Voice recognition method and related device Download PDFInfo
- Publication number
- CN114360510A CN114360510A CN202210042387.1A CN202210042387A CN114360510A CN 114360510 A CN114360510 A CN 114360510A CN 202210042387 A CN202210042387 A CN 202210042387A CN 114360510 A CN114360510 A CN 114360510A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- frame
- data
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 238000009826 distribution Methods 0.000 claims abstract description 79
- 238000013528 artificial neural network Methods 0.000 claims abstract description 76
- 238000000605 extraction Methods 0.000 claims description 39
- 230000006870 function Effects 0.000 claims description 28
- 230000003190 augmentative effect Effects 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 9
- 230000003321 amplification Effects 0.000 claims description 8
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 8
- 230000003416 augmentation Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 34
- 238000013473 artificial intelligence Methods 0.000 abstract description 10
- 238000001514 detection method Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 14
- 238000012549 training Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 101100506221 Nitrosomonas europaea (strain ATCC 19718 / CIP 103999 / KCTC 2705 / NBRC 14298) hao3 gene Proteins 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000037007 arousal Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Images
Landscapes
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the application discloses a voice recognition method and a related device, which at least relate to a voice recognition technology in artificial intelligence, and the voice data to be recognized is used as input data of a time delay neural network in an acoustic model. When the syllable is identified through the output layer, the syllable information before and after the voice frame can be combined, and the syllable of the voice frame is subjected to auxiliary judgment based on the pronunciation rule so as to output more accurate syllable probability distribution. Moreover, because syllables are generally composed of one or more phonemes, the method has higher fault-tolerant capability, not only obtains more accurate voice recognition results based on syllable probability distribution, but also has low requirements on the quality of voice data to be recognized, and effectively expands the application scenes of the voice recognition technology.
Description
Technical Field
The present application relates to the field of speech recognition, and in particular, to a speech recognition method and related apparatus.
Background
The voice content recognition service can be provided for the user through a voice recognition technology, and the technology can be applied to various scenes, such as voice-to-text, voice awakening, human-computer interaction and the like. In a specific implementation, the acoustic features of the speech data to be recognized may be extracted through an acoustic model, and the corresponding speech recognition result may be determined based on the acoustic features.
The related art mainly uses a phone (phone) as a modeling unit of an acoustic model, the phone is a minimum speech unit divided according to natural attributes of speech, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phone.
However, the modeling granularity of the phoneme is fine, the quality requirement of the speech data to be recognized is high in the fine-grained speech recognition mode, and the recognition result may be directly influenced by fine pronunciation errors. Thereby making it difficult for speech recognition techniques to adapt to some speech recognition scenarios.
Disclosure of Invention
In order to solve the above technical problems, the present application provides a speech recognition method and a related apparatus, which are used to improve the accuracy of a speech recognition result and expand the use scenario of a speech recognition technology.
In one aspect, an embodiment of the present application provides a speech recognition method, where the method includes:
acquiring an acoustic model and voice data to be recognized, wherein the acoustic model comprises a time delay neural network, and an output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables;
using the voice data as input data of the time delay neural network, and determining syllable probability distribution corresponding to the voice frame included in the voice data through the time delay neural network, wherein the syllable probability distribution is used for identifying the probability that the voice frame corresponds to each of the plurality of syllables;
and determining a voice recognition result corresponding to the voice data according to the syllable probability distribution.
On the other hand, the embodiment of the application provides a voice recognition device, which comprises an acquisition unit, a syllable probability distribution determination unit and a voice recognition result determination unit;
the acquiring unit is used for acquiring an acoustic model and voice data to be recognized, the acoustic model comprises a time delay neural network, and an output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables;
the syllable probability distribution determining unit is configured to use the speech data as input data of the time-delay neural network, and determine, by the time-delay neural network, syllable probability distributions corresponding to speech frames included in the speech data, where the syllable probability distributions are used to identify probabilities that the speech frames correspond to the plurality of syllables, respectively;
and the voice recognition result determining unit is used for determining the voice recognition result corresponding to the voice data according to the syllable probability distribution.
In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of the above aspect according to instructions in the program code.
In another aspect, the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.
In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above aspect.
According to the technical scheme, the acoustic model comprising the time delay neural network is obtained aiming at the voice data to be recognized. The voice data is used as input data of the time delay neural network, and the output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables, so that the syllables can be used as identification granularity through the time delay neural network, and syllable probability distribution respectively corresponding to voice frames included in the voice data is obtained through each acoustic modeling unit of the output layer. The time delay neural network carries the rich context information of the voice frame in the voice data in the voice feature extraction and transmission processes to the output layer, and the context information can embody the related information of the front and back syllables of the voice frame in the voice data, so that when the voice frame is identified by the syllables through the output layer, the auxiliary judgment can be carried out on the syllables of the voice frame based on the pronunciation rule by combining the front and back syllable information of the voice frame, and more accurate syllable probability distribution can be output. Moreover, because syllables are generally composed of one or more phonemes, the fault-tolerant capability is higher, even if pronunciation errors of individual phonemes occur in one syllable in the voice data, the influence on the recognition result of the whole syllable is smaller compared with the phonemes, so that even if the quality of the voice data to be recognized is not high, a more accurate determined voice recognition result can be obtained based on syllable probability distribution, and the application scene of the voice recognition technology is effectively expanded.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present application;
fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a TDNN provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of an acoustic model provided by an embodiment of the present application;
fig. 5 is a schematic diagram of an embodiment of an application scenario of a speech recognition system according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an acoustic model provided by an embodiment of the present application;
fig. 7 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 8 is a structural diagram of a terminal device according to an embodiment of the present application;
fig. 9 is a block diagram of a server according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
In recognizing speech data, in the related art, a phoneme is used as a modeling unit of an acoustic model, for example, a phoneme sequence corresponding to a keyword "hello" is "n iy3 hh aw 3", and the keyword "hello" can be recognized only if the phoneme sequence "n iy3 hh aw 3" is recognized in the speech data. Since the modeling granularity of a phoneme is too fine, it has a high requirement on the quality of the speech data. If one factor in the keyword is not standard in pronunciation, the keyword recognition is failed, so that the accuracy of the voice recognition result is low, and even in some complex voice recognition scenes such as noisy background sound, nonstandard user speech and the like, the robustness of the voice recognition is low, so that the applicable scenes of the voice recognition technology are few.
Based on this, the embodiment of the application provides a voice recognition method, which not only improves the accuracy of the voice recognition result, but also has low requirement on the quality of voice data, and effectively expands the application scenes of the voice recognition technology.
The speech recognition method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
In the embodiment of the present application, the artificial intelligence technology mainly involved includes the above-mentioned voice processing technology and the like.
The voice recognition method provided by the application can be applied to voice recognition equipment with data processing capacity, such as terminal equipment and servers. The terminal device related to the present application may specifically be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, an intelligent watch, an intelligent sound box, a vehicle-mounted device, a wearable device, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, and the like, but is not limited thereto. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, safety service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited.
The aforementioned Speech recognition device may be provided with Speech processing Technology, and the key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology and Speech synthesis Technology, and voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
In the speech recognition method provided by the embodiment of the application, the adopted artificial intelligence model mainly relates to the application of the speech recognition technology, and the speech recognition result corresponding to the speech data to be recognized is determined through the acoustic model.
In order to facilitate understanding of the technical solution of the present application, the following describes a speech recognition method provided in the embodiments of the present application with a terminal device as a speech recognition device in combination with an actual application scenario.
Referring to fig. 1, the figure is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present application. In the application scenario shown in fig. 1, the terminal device is a smart speaker 100, and is configured to identify a keyword "hello" of voice data to be identified.
After the intelligent sound box 100 acquires the voice data to be recognized, the voice data is input into the time delay neural network, the time delay neural network comprises a plurality of layers of networks, each layer of network has strong abstract capability on voice characteristics, the time delay neural network has a wide context visual field and can capture wider context information, the context information can embody related information of front and back syllables of a voice frame in the voice data, therefore, when the voice data is recognized through syllables through an output layer, the front and back syllable information of the voice frame can be combined, auxiliary judgment is carried out on the syllables of the voice frame based on pronunciation rules, and the output probability distribution of the syllables with the syllable sequence of 'ni 3 hao 3' is more accurate.
The smart sound box 100 determines that the voice recognition result corresponding to the voice data is more accurate according to the more accurate syllable probability distribution. Therefore, the accuracy of the voice recognition result is improved, the requirement on the quality of the voice data is reduced, and the application scene of the voice recognition technology is effectively expanded.
A speech recognition method provided in an embodiment of the present application is described below with reference to the drawings, where a server is used as a speech recognition device.
Referring to fig. 2, the figure is a flowchart of a speech recognition method according to an embodiment of the present application. As shown in fig. 2, the speech recognition method includes the steps of:
s201: an acoustic model and speech data to be recognized are obtained.
In the related art, a Gaussian Mixed Model (GMM) -Hidden Markov Model (HMM) using phonemes as a modeling unit, i.e., a GMM-HMM Model, becomes a mainstream system for an Automatic Speech Recognition (ASR) task. However, with the accumulation of labeled linguistic data and the rapid increase of computer computing power, the effects of many newly proposed deep neural network models have greatly surpassed those of the GMM-HMM model and are still further improved. Due to the powerful modeling capabilities of these deep neural networks, the output labels of the acoustic model no longer need to be subdivided into phonemes. Many research efforts have found that acoustic models can also be used with greater granularity of modeling and in most cases with greater effectiveness.
Based on this, the embodiments of the present application no longer use phonemes as the acoustic modeling unit, but use syllables as the acoustic modeling unit. Wherein, syllables mainly comprise initial consonants, vowels and tones, and each syllable can be decomposed into one or more phonemes. Referring to table 1, the differences between syllables and phonemes are illustrated by the keywords "hello" and "world".
TABLE 1 syllable to phoneme distinction
Keyword | Syllable sequence | Phoneme sequence |
You good | ni3 hao3 | n iy3 hh aw3 |
World of things | shi4 jie4 | sh iy4 j iy4 eh4 |
Compared with phonemes, syllables can carry more context information, so that the context learning capability of the acoustic model is improved, and the coverage rate of the speech keywords is improved. In addition, syllables are more consistent with the human speech process and are easier to understand. Moreover, since syllables are generally composed of one or more phonemes, the fault tolerance is higher, even if pronunciation errors of individual phonemes occur in one syllable in the voice data, the influence on the recognition result of the whole syllable is smaller compared with the phonemes, so that the requirement on the quality of the voice data is reduced, and the application scene of the voice recognition technology is effectively expanded.
The Acoustic Model (AM) includes a Time Delay Neural Network (TDNN), which will be described later. The voice data is also called a sound file, and is data recorded by voice, for example, the sound for waking up the smart speaker may be called voice data.
It is understood that in the specific implementation of the present application, the user-related data such as voice data is referred to, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.
It should be noted that the acoustic model and the voice data to be recognized may be obtained simultaneously, the voice data to be recognized may be obtained after the acoustic model is obtained in advance, or the acoustic model may be obtained after the voice data to be recognized is obtained, which is not specifically limited in the present application.
S202: and determining syllable probability distribution corresponding to the voice frames included in the voice data through the time delay neural network by taking the voice data as input data of the time delay neural network.
The acoustic model may convert the speech input into an acoustically represented syllable output, and in the embodiment of the present application, for each speech frame, a probability distribution of the syllable to which it belongs is calculated. The acoustic model includes a TDNN, as shown in fig. 3, which is a schematic diagram of a TDNN provided in an embodiment of the present application.
As shown in fig. 3, the TDNN includes multiple layers of networks, each layer of which has a strong abstraction capability for voice features. And the TDNN has a wider context view, can capture wider context information and has stronger modeling capability on the voice time sequence dependent information.
In the embodiment of the present application, the output layer of the TDNN includes acoustic modeling units corresponding to a plurality of syllables, and after the speech data is input to the TDNN, the TDNN may determine the probability distribution of the syllables corresponding to the speech frames included in the speech data by using the syllables as the recognition granularity. Wherein the syllable probability distribution is used for identifying the probability that the voice frame corresponds to each of the plurality of syllables. As shown in fig. 3, in the output layer of TDNN, the probability corresponding to each syllable can be output. Fig. 3 shows only 4 syllables as an example, but is not limited to 4 syllables. For example, all syllables required in chinese may be counted and represented by an output layer, thereby determining a recognition result corresponding to the speech data to be recognized input by the chinese user.
Because the time delay neural network can carry the rich context information of the voice frame in the voice data in the voice feature extraction and transmission processes to the output layer, the context information can embody the related information of the front and back syllables of the voice frame in the voice data, so that when the voice frame is identified by the syllables through the output layer, the auxiliary judgment can be carried out on the syllables of the voice frame based on the pronunciation rule by combining the front and back syllable information of the voice frame, and more accurate syllable probability distribution can be output.
S203: and determining a voice recognition result corresponding to the voice data according to the syllable probability distribution.
Wherein the syllable probability distribution is used for identifying the probability that the voice frame corresponds to each of the plurality of syllables. For example, if the output layer of TDNN outputs probabilities corresponding to 100 syllables, the probability distribution of the syllables obtained by TDNN is a row vector of 1 × 100, each element in the row vector represents the probability corresponding to the syllable, and the sum of the probabilities corresponding to 100 elements is 1. It will be appreciated that the greater the probability of a syllable, the greater the likelihood that the actual pronunciation of the speech frame will be the same as the pronunciation of that syllable. Continuing with the above example, if the syllable probability distribution obtained by the speech frame a is [0.1, 0.8, 0.1, 0, … …, 0] (95 0 s are omitted in the middle), the possibility that the actual pronunciation of the speech frame a is the pronunciation composed of the first three syllables is high, or the possibility that the actual pronunciation of the speech frame a is the pronunciation of the second syllable is high, which is not specifically limited in the present application, and those skilled in the art can set the probability distribution according to the actual application scenario.
After obtaining the syllable probability distribution corresponding to each voice frame in the voice data, the voice recognition result corresponding to the voice data can be further determined according to the syllable probability distribution corresponding to each of the plurality of voice frames. If a user inputs a section of voice, the voice is used as voice data to be recognized, syllable probability distribution corresponding to voice frames included in the voice data can be obtained through TDNN, and then a voice recognition result is determined.
According to the technical scheme, the acoustic model comprising the time delay neural network is obtained aiming at the voice data to be recognized. The voice data is used as input data of the time delay neural network, and the output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables, so that the syllables can be used as identification granularity through the time delay neural network, and syllable probability distribution respectively corresponding to voice frames included in the voice data is obtained through each acoustic modeling unit of the output layer. The time delay neural network carries the rich context information of the voice frame in the voice data in the voice feature extraction and transmission processes to the output layer, and the context information can embody the related information of the front and back syllables of the voice frame in the voice data, so that when the voice frame is identified by the syllables through the output layer, the auxiliary judgment can be carried out on the syllables of the voice frame based on the pronunciation rule by combining the front and back syllable information of the voice frame, and more accurate syllable probability distribution can be output. Moreover, because syllables are generally composed of one or more phonemes, the fault-tolerant capability is higher, even if pronunciation errors of individual phonemes occur in one syllable in the voice data, the influence on the recognition result of the whole syllable is smaller compared with the phonemes, so that even if the quality of the voice data to be recognized is not high, a more accurate determined voice recognition result can be obtained based on syllable probability distribution, and the application scene of the voice recognition technology is effectively expanded.
It should be noted that the method and the device can be applied to voice Keyword detection (KWS) tasks such as wakeup interaction, audio and video file voice Keyword detection and the like in the intelligent device. Due to the limitation of markup language materials and computational power, the acoustic models such as Gaussian Mixed Mode (GMM), Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and the like are not effective enough, which results in poor effect in the smart device. In order to meet the limited computing resource limit of intelligent equipment and the processing requirement of massive audio/video files, the keyword detection speed is required to be fast enough, and the detection result is expected to be accurate enough.
Based on this, refer to fig. 4, which is a schematic diagram of an acoustic model provided in the embodiments of the present application. As shown in fig. 4, the acoustic model includes not only TDNN but also a decoder. Because TDNN has a wider context field of view and can capture wider context information, the method can further increase the modeling capability of an acoustic model on the voice time sequence information, namely, TDNN can process the time sequence information more efficiently, thereby improving the accuracy of keyword detection. The decoder is used for determining whether the voice data comprises the voice recognition result of the keyword according to the syllable probability distribution so as to determine whether to awaken the corresponding terminal equipment according to the voice recognition result.
For convenience of description, the following description takes an example of a wake-up interaction scenario of the smart device, see S2031 to S2033:
s2031: and determining a keyword corresponding to the awakening scene according to the awakening scene corresponding to the voice data.
It can be understood that, since the wake-up scenario is different, the corresponding keywords may be different, and those skilled in the art can set the keywords according to actual needs. For example, a keyword corresponding to the smart home wake-up scene is "open XX (home device)", a keyword corresponding to the virtual assistant wake-up scene is "XXX (virtual assistant name)", and the like, and according to the setting of the keyword by a person skilled in the art, after acquiring voice data to be recognized, the keyword corresponding to the wake-up scene is determined according to the corresponding wake-up scene.
It should be noted that S2031 may be executed after obtaining the syllable probability distribution, or may be executed after acquiring the speech data to be recognized, which is not specifically limited in this application.
S2032: a speech recognition result for identifying whether the keyword is included in the speech data is determined by the decoder based on the syllable probability distribution.
The decoder is used for decompressing the syllable probability distribution into the voice recognition result, so after the syllable probability distribution is obtained through the TDNN, the voice recognition result corresponding to the voice data can be determined through the decoder according to the syllable probability distribution, and the voice recognition result used for identifying whether the voice data comprises the keyword can also be determined through the decoder.
As a possible implementation manner, for a given keyword, after decoding by a decoder based on Weighted Finite State Transducer (WFST), a speech recognition result of whether the keyword is included or not may be directly output. The WFST provides a unified framework, information from an acoustic model, a pronunciation dictionary and a language model can be fused in a unified mode, an optimal path is searched in a huge search space, and therefore the search speed of the model can be improved.
As a possible implementation manner, in the decoding process, a Lattice static decoding strategy based on WFST can be further adopted to search an optimal path in a search space, so that the search speed of the model is improved.
As a possible implementation manner, in the decoding process, the decoding speed can also be increased by methods such as frame pruning and the like. For example, the generated syllable probability distribution is a vector of 1 × 100, the first three probabilities are 0.1, 0.8 and 0.1, respectively, and the last 97 probabilities are all 0, the last 97 are pruned by means of frame pruning, and the vector of 1 × 100 is changed into a vector of 1 × 3, so that only the vector of 1 × 3 needs to be decoded, and the decoding speed is further improved.
S2033: and if the voice recognition result indicates that the voice data comprises the keyword, awakening the terminal equipment corresponding to the awakening scene.
According to the technical scheme, syllables are used for replacing phonemes to serve as the modeling unit of the acoustic model, the granularity is coarse, the occupied memory is small, and the required computing resources are less. For the voice keywords, the syllable coverage rate is higher, the fault tolerance is higher, and the accuracy is higher. Meanwhile, for the voice data to be recognized with the same length, the speed of detecting the keywords based on the syllables is higher, when the limited computing resource limit and the processing requirement of massive audio and video files are met, because the acoustic model comprises the TDNN and the decoder, the TDNN can further increase the modeling capacity of the acoustic model on the voice time sequence information, and the decoder can determine whether the voice data comprises the voice recognition result of the keywords, so that the accuracy of detecting the keywords is improved. Therefore, the acoustic model provided by the embodiment of the application not only needs the keyword detection speed to be faster, but also has higher accuracy of the detection result, the performance and the speed of the acoustic model are well balanced, and the effect in the intelligent device is better.
With respect to the foregoing S2032, the embodiment of the present application does not specifically limit the manner in which the speech recognition result is determined by the decoder, and for example, the speech recognition result used for identifying whether the keyword is included in the speech data may be determined by the decoder according to the syllable probability distribution and the matching word list.
For example, in the virtual assistant wake-up scenario, the corresponding keyword is "hello world", and a corresponding keyword table is established according to the keyword. After a user inputs a section of voice, the voice is used as voice data to be recognized, syllable probability distribution is determined through the acoustic model provided by the embodiment of the application, a decoder searches a keyword table corresponding to the 'hello world', and whether the voice data input by the user comprises the 'hello world' is determined by combining the syllable probability distribution. If the voice data includes "hello world", the decoder may output a result of "including keyword", and further, may wake up the virtual assistant according to the voice recognition result to provide a service to the user, as shown in S2033. The terminal device corresponding to the wake-up scene may be a terminal device such as an intelligent sound box, a virtual assistant, and intelligent furniture, which is not specifically limited in this application.
The keyword list is established by taking the keywords as a core and comprises possibly corresponding pronunciations of the keywords. The embodiment of the present application does not specifically limit the determining manner of the keyword list, for example, a matching word list for the keyword may be constructed by using syllables as the division granularity according to similar syllables and/or polyphone syllables and syllables corresponding to the keyword. Through abundant matching word lists of similar syllables and/or polyphone syllables, the coverage degree of the keywords is effectively improved, missing awakening is further avoided, and the use feeling of a user is improved. Similar syllables and polyphonic syllables are described below.
Similar syllables: similar syllables are similar syllables corresponding to syllables included in the keyword according to the pronunciation similarity, for example, the syllable sequence of the keyword "hello" is "ni 3 hao 3", when the keyword is read by elongating, you may be read as one sound instead of three sounds, so the syllable sequence "ni 1 hao 3" can be added as the similar syllables into the collocation keyword table.
Polyphone syllables: the polyphone is a word with more than one pronunciation, if the keyword has the polyphone, the polyphone syllable corresponding to the polyphone is determined, and all the pronunciations of the polyphone should be included in the keyword list so as to prevent the missed arousal caused by the accent of the user. Continuing with the keyword "hello" as an example, the syllable sequence "ni 3 hao 4" may be added as a similar syllable to the collocation keyword table.
As a possible implementation, some error-prone pronunciation may also be added to the keyword list, such as the keyword "elegant" is easily read as "dang (triphone)", so the syllable "dang 3" is included in the keyword list corresponding to the keyword "elegant".
As a possible implementation manner, the TDNN may not only combine the front and back syllable information of the speech frame, but also combine the speech frame characteristics corresponding to at least one frame before and after the i-th speech frame in the speech data, so as to further improve the accuracy of syllable probability distribution. Taking the TDNN including N layers of feature extraction layers as an example, S202 may be implemented by S2021 to S2023 for an i-th frame speech frame in speech frames included in speech data, which is specifically as follows:
s2021: and determining the speech frame characteristics of the ith frame speech frame through the jth layer characteristic extraction layer according to the output characteristics of the jth-1 layer characteristic extraction layer aiming at the ith frame speech frame.
Each layer of the network included in the TDNN may be referred to as a feature extraction layer, and each layer of the feature extraction layer performs feature extraction on each frame of speech frame included in the speech data to be recognized. In the following, for the i-th frame speech frame, two feature extraction layers, i.e., the j-1 th feature extraction layer and the j-th feature extraction layer, are arbitrarily selected from the N feature extraction layers. Wherein j ∈ N.
And (3) performing feature extraction on the j-1 th layer of feature extraction layer aiming at the ith frame of voice frame, thereby outputting the output features aiming at the ith frame of voice frame, inputting the output features to the j-1 th layer of feature extraction layer, performing feature extraction on the j-1 th layer of feature extraction layer, and determining the voice frame features of the ith frame of voice frame.
S2022: and determining the output characteristic of the ith frame of voice frame at the jth layer characteristic extraction layer according to the voice frame characteristic and the voice frame characteristic corresponding to at least one frame before and after the ith frame of voice frame in the voice data.
In order to further improve the context learning capability of the acoustic model, after determining the speech frame characteristics of the i-th speech frame, the output characteristics of the i-th speech frame at the j-th layer characteristic extraction layer can be determined by combining not only the front and back syllable information of the speech frame, but also the speech frame characteristics corresponding to at least one frame of the front and back of the i-th speech frame in the speech data, that is, the front and back text information of the speech frame further carries more context information.
Continuing to refer to fig. 3, each upper node of the graph is connected to three lower nodes, each node corresponds to a frame of speech frame feature, and S2022 may determine the output feature of the ith frame of speech frame at the jth feature extraction layer according to the speech frame feature, the speech frame feature corresponding to the previous frame of the ith frame of speech frame in the speech data, and the speech frame feature corresponding to the next frame. It can be understood that speech frame characteristics corresponding to the first three frames of the current frame, speech frame characteristics corresponding to the next frame, and the like can also be considered, and those skilled in the art can set the speech frame characteristics according to actual needs, which is not specifically limited in this application.
As a possible implementation manner, the N layers of feature extraction layers included in the TDNN may all implement the above S2021-S2022, until all feature extraction layers perform feature extraction on the i-th frame speech frame, and further determine the syllable probability distribution corresponding to the i-th frame speech frame.
S2023: and determining syllable probability distribution corresponding to the ith frame of voice frame according to the output characteristics of the ith frame of voice frame in the jth layer of characteristic extraction layer.
Therefore, after the voice data to be recognized is input into the TDNN, the syllable probability distribution corresponding to the voice frame of the ith frame can be obtained, so that the syllable probability distribution corresponding to the voice frame included in the voice data is obtained, and the voice recognition result corresponding to the voice data is further determined.
As a possible implementation manner, a training process of the acoustic model provided in the embodiment of the present application is described below.
S301: and acquiring a voice sample belonging to the same language as the voice data.
The pronunciation rules of the voice data belonging to the same language are basically the same, and the corresponding syllables are basically the same, for example, the syllables corresponding to the Chinese language are different from the syllables corresponding to the English language, so that the voice sample can be obtained based on the language to which the voice data belongs, and the acoustic model can be trained.
S302: and obtaining a prediction result through an initial time delay neural network included by the initial acoustic model according to the voice sample as the input data of the initial acoustic model.
And inputting the voice sample into an initial time delay neural network included by the initial acoustic model to obtain a prediction result. In contrast to the voice sample labels, the prediction results may include two types, correct prediction results and incorrect prediction results. It should be noted that, the correct prediction result is that the error between the prediction result and the corresponding sample label is smaller than the threshold, and the incorrect prediction result is that the error between the prediction result and the corresponding sample label is greater than or equal to the threshold. The threshold value is not specifically limited in the embodiments of the present application, and can be determined by those skilled in the art according to actual needs.
S303: and determining a loss function according to the prediction result and the sample label of the voice sample.
The loss function comprises a guidance weight, and the guidance weight is used for improving the influence of the identification path of the correct prediction result in the initial delay neural network and reducing the influence of the identification path of the wrong prediction result in the initial delay neural network.
As a possible implementation, the number of speech samples input into the initial TDNN may be multiple. For example, the speech samples include 100, and 20 may be input into the initial TDNN as a batch, divided into 5 times, to determine the loss function. If 20 speech samples are input into the initial TDNN, 5 correct prediction results and 15 incorrect prediction results are obtained. When the loss function is determined, the influence of the identification paths where the 5 correct prediction results are located is improved through the guiding weight, and the influence of the identification paths where the 15 wrong prediction results are located is reduced, so that the subsequent training can trust the identification paths with higher influence, the influence of the identification paths corresponding to the correct prediction results is higher and higher, the influence of the identification paths corresponding to the wrong prediction results is lower and lower, and the situation that the local optimization is caused is avoided.
The initial TDNN comprises a plurality of layers of networks, one layer of network comprises a plurality of nodes, and each node has different output characteristics from the previous layer of network, so that the learned contents are different, and finally, a plurality of identification paths are formed. As shown in fig. 3, if the TDNN model is an initial TDNN model, the first node of the second layer may obtain different output characteristics from the first node, the second node, and the third node of the first layer from the left, so as to form three paths, and continue upward until the last layer, so as to form a plurality of identification paths from the first layer to the nodes of the last layer, where different identification paths may correspond to different final prediction results, and further, according to the prediction results and the sample labels, the influence corresponding to different identification paths is different, such as different weights corresponding to different identification paths.
As a possible implementation manner, a lattice-free Maximum Mutual Information (LF-MMI) loss function can be added on the basis of a frame-level cross-entropy (CE) loss function, namely, a mono-phone-based LF-MMI model is adopted to replace a mono-phone-based nnet3 model for guiding error transfer of an initial TDNN model. The LF-MMI model introduces blank spaces (blank) to absorb uncertain boundaries, i.e. to absorb coincident, meaningless outputs, such as "ni 3 hao3 hao 3" or "ni 3 ni3 hao 3".
S304: and adjusting model parameters of the initial time delay neural network in the initial acoustic model according to the loss function to obtain the acoustic model.
And adjusting model parameters of the initial delay neural network in the initial acoustic model according to the loss function, thereby obtaining the acoustic model comprising the adjusted delay neural network.
As a possible implementation manner, if the initial acoustic model includes the initial delay neural network and the initial decoder, model parameters of the initial delay neural network and the initial decoder in the initial acoustic model may be adjusted according to a loss function, so as to obtain the acoustic model including the adjusted delay neural network and the adjusted decoder.
Therefore, the influence of the correct prediction result on the identification path in the initial time delay neural network is improved through the guidance weight included by the loss function, and the influence of the error prediction result on the identification path in the initial time delay neural network is reduced, so that the identification probability of the correct prediction result is higher through distinguishing training, other probabilities are as small as possible, the identification rate of the trained acoustic model is improved, and the training of the acoustic model is faster and more stable.
As a possible implementation manner, if the number of speech samples used for training is small, the accuracy of the acoustic model may be affected, so that the number of speech samples may be increased based on data expansion for a scene with an insufficient number of speech samples, such as a small language. The method comprises the following specific steps:
s3031: and acquiring a voice sample to be processed which belongs to the same language as the voice data.
S3032: and carrying out data amplification according to the voice sample to be processed to obtain an amplified sample.
And determining the sample label of the augmented sample based on the sample label of the corresponding voice sample to be processed. The present embodiment is not particularly limited to the manner of data expansion, and three types of embodiments are described below as examples.
The first method is as follows: by means of the voice coding mode, the amplification samples with different sampling rates, different sound channels and different coding formats are generated aiming at the voice samples to be processed.
The second method comprises the following steps: and randomly generating multiplication factors in a preset interval by a speech speed and volume disturbance mode to change the speed and volume of the speech sample to be processed, thereby obtaining an augmented sample. The preset interval may be between 0.9 and 1.1, and a person skilled in the art may add the preset interval according to actual needs, which is not specifically limited in the present application.
The third method comprises the following steps: by adding background music and noise, namely various background music and various scene noises are added into a cleaner voice sample to obtain an augmented sample.
It should be noted that the augmented sample may be generated by combining one or more of the above three ways.
S3033: and obtaining the voice sample according to the voice sample to be processed and the augmentation sample.
According to the technical scheme, due to the fact that great differences exist among voices, such as the sampling rate, the encoding rate and the sound channel difference of audio and video, and the speed, the tone, the environmental interference and the like of a speaker, the difficulty of voice keyword detection is greatly increased. If a large amount of voice data of a certain scene is collected and labeled, then obtaining a large amount of voice samples is a very high-cost matter, and even in some low-resource and small-language scenes, the number of voice samples is less. Therefore, in a scene with a small number of voice samples, the number of voice samples is expanded in a manner like S3031-S3033, so that the robustness of the acoustic model in different scenes can be improved under the condition of limited voice samples.
When training an acoustic model, each frame needs to have an output as a label. However, after data is augmented, the speech frames of the augmented samples do not correspond to the sample labels, and based on this, the speech frames of the augmented samples and the sample labels can be in one-to-one correspondence in an alignment (alignment) manner, that is, one model is used to make the output sequences and the input features in one-to-one correspondence, which is specifically as follows:
taking a sample label of the voice sample to be processed corresponding to the augmented sample as an undetermined label of the augmented sample;
and carrying out label alignment treatment according to the voice frame of the augmented sample and the undetermined label to obtain a sample label of the augmented sample.
In the related art, the method is usually implemented by a GMM, which calculates a score of each frame on a potential label, and outputs a final alignment label in combination with dynamic programming. However, as the markup corpus increases and the computing power increases, the effect of the deep neural network model slowly surpasses the GMM model. For better effect, an acoustic Model based on Chain Model can be used to replace the GMM Model, so that the accuracy of the output tag sequence is higher.
Next, with reference to fig. 5 and fig. 6, a speech recognition method provided in an embodiment of the present application will be described with respect to a keyword recognition scenario, taking an example in which an acoustic model includes a TDNN and a decoder.
Referring to fig. 5, the figure is a schematic view of an embodiment of an application scenario of a speech recognition system according to an embodiment of the present application. The speech recognition system includes a feature extraction module 501, a TDNN module 502, a decoding module 503, and a vocabulary generation module 504, which are described below.
The feature extraction module 501 converts continuous speech data to be recognized into discrete vectors through a digital signal processing algorithm, and the vectors can effectively represent speech features corresponding to the speech data to be recognized, thereby facilitating subsequent speech tasks.
As a possible implementation manner, in the speech feature extraction process, fast fixed-point processing may also be performed, that is, floating-point numbers are simulated by integers. Specifically, floating point type values are mapped to integer type values, such as int8, int16, or int32, etc., according to the dynamic range of the data through linear mapping, which is nearly twice as fast as kaldi tools (an open source speech recognition tool) through testing.
The TDNN module 502 is configured to determine syllable probability distributions corresponding to voice frames included in the voice data according to the voice characteristics.
The decoding module 503 determines a voice recognition result for determining whether a keyword is included in the voice data.
The decoder and the TDNN may form an acoustic model as shown in fig. 6, and input the extracted speech features into the TDNN to obtain syllable probability distribution, and input the syllable probability distribution into the decoder. And meanwhile, the voice features are subjected to data alignment and then input to a decoder, and the decoder determines a voice recognition result for identifying whether the voice data comprises the keywords or not according to the syllable probability distribution and the matching word list.
The vocabulary generating module 504 is configured to generate a syllable sequence corresponding to the keyword according to the dictionary.
Therefore, by cascading the four modules, the voice recognition system can efficiently detect all keywords and start and stop positions thereof in the voice data to be recognized.
In this scenario of keyword recognition, speech data to be recognized having a duration of 27.5 hours was used as a test set, and the effects of the two keyword detection methods are listed by table 2. From the results, compared with the traditional keyword/filling-based keyword detection system, the keyword detection provided by the embodiment of the application has obvious improvement in both accuracy and coverage, and the overall F1 is improved from 65.99% to 75.57%. This result demonstrates that the speech recognition method provided by the embodiment of the present application can effectively increase the number of recalls of the keyword, and can more effectively reduce the number of false alarms of the keyword.
Table 2 results of performance comparison experiments
Method | Rate of accuracy | Recall rate | F1 |
Keyword/fill based keyword detection | 86.76% | 53.24% | 65.99% |
Keyword detection based on embodiment of application | 92.00% | 64.12% | 75.57% |
Improving the effect | +5.23% | +10.89% | +9.58% |
Wherein, F1 is an index used to measure the accuracy of the two-class model in statistics. The method simultaneously considers the accuracy rate and the recall rate of the classification model. The F1 score can be viewed as a harmonic mean of model accuracy and recall with a maximum of 1 and a minimum of 0.
Aiming at the voice recognition method provided by the embodiment, the embodiment of the application also provides a voice recognition device.
Referring to fig. 7, the figure is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application. As shown in fig. 7, the speech recognition apparatus 700 includes: an acquisition unit 701, a syllable probability distribution determination unit 702, and a voice recognition result determination unit 703;
the obtaining unit 701 is configured to obtain an acoustic model and voice data to be recognized, where the acoustic model includes a time-delay neural network, and an output layer of the time-delay neural network includes acoustic modeling units corresponding to a plurality of syllables, respectively;
the syllable probability distribution determining unit 702 is configured to use the speech data as input data of the time-delay neural network, and determine, by the time-delay neural network, syllable probability distributions corresponding to speech frames included in the speech data, where the syllable probability distributions are used to identify probabilities that the speech frames correspond to the plurality of syllables, respectively;
the voice recognition result determining unit 703 is configured to determine a voice recognition result corresponding to the voice data according to the syllable probability distribution.
As a possible implementation manner, the acoustic model further includes a decoder, and the speech recognition result determining unit 703 is configured to:
determining a keyword corresponding to the awakening scene according to the awakening scene corresponding to the voice data;
determining, by the decoder, a speech recognition result for identifying whether the keyword is included in the speech data according to the syllable probability distribution;
the method further comprises the following steps:
and if the voice recognition result indicates that the voice data comprises the keyword, awakening the terminal equipment corresponding to the awakening scene.
As a possible implementation manner, the apparatus further includes a matching word list constructing unit, configured to:
establishing a matching word list aiming at the key words by taking syllables as division granularity;
the speech recognition result determining unit 703 is configured to:
and determining a voice recognition result for identifying whether the keyword is included in the voice data through the decoder according to the syllable probability distribution and the matching word list.
As a possible implementation manner, the time-delay neural network includes N layers of feature extraction layers, j ∈ N, and for an ith speech frame in the speech frames included in the speech data, the syllable probability distribution determining unit 702 is configured to:
determining the speech frame characteristics of the ith frame speech frame through a jth layer characteristic extraction layer according to the output characteristics of the jth-1 layer characteristic extraction layer aiming at the ith frame speech frame;
determining the output characteristics of the ith frame of voice frame at the jth layer characteristic extraction layer according to the voice frame characteristics and the voice frame characteristics corresponding to at least one frame before and after the ith frame of voice frame in the voice data;
and determining syllable probability distribution corresponding to the ith frame of voice frame according to the output characteristics of the ith frame of voice frame on the jth layer of characteristic extraction layer.
As a possible implementation manner, the matching word list constructing unit is configured to:
determining similar syllables of syllables included in the keyword according to the pronunciation similarity;
if the keyword has polyphone, determining polyphone syllables corresponding to the polyphone;
and according to at least one of the similar syllables and the polyphone syllables and the syllables corresponding to the keyword, constructing a matching word list aiming at the keyword by taking the syllables as the division granularity.
As a possible implementation manner, the apparatus further includes a training unit configured to:
acquiring a voice sample belonging to the same language as the voice data;
obtaining a prediction result through an initial time delay neural network included in an initial acoustic model according to the voice sample as input data of the initial acoustic model;
determining a loss function according to the prediction result and the sample label of the voice sample, wherein the loss function comprises a guidance weight, and the guidance weight is used for improving the influence of the identification path of the correct prediction result in the initial time delay neural network and reducing the influence of the identification path of the wrong prediction result in the initial time delay neural network;
and adjusting model parameters of an initial time delay neural network in the initial acoustic model according to the loss function to obtain the acoustic model.
As a possible implementation, the training unit is configured to:
acquiring a voice sample to be processed which belongs to the same language as the voice data;
performing data amplification according to the voice sample to be processed to obtain an amplification sample, wherein a sample label of the amplification sample is determined based on a sample label of the corresponding voice sample to be processed;
and obtaining the voice sample according to the voice sample to be processed and the augmentation sample.
As a possible implementation manner, the training unit is further configured to:
taking a sample label of the voice sample to be processed corresponding to the augmented sample as an undetermined label of the augmented sample;
and carrying out label alignment treatment according to the voice frame of the augmented sample and the undetermined label to obtain a sample label of the augmented sample.
According to the technical scheme, the acoustic model comprising the time delay neural network is obtained aiming at the voice data to be recognized. The voice data is used as input data of the time delay neural network, and the output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables, so that the syllables can be used as identification granularity through the time delay neural network, and syllable probability distribution respectively corresponding to voice frames included in the voice data is obtained through each acoustic modeling unit of the output layer. The time delay neural network carries the rich context information of the voice frame in the voice data in the voice feature extraction and transmission processes to the output layer, and the context information can embody the related information of the front and back syllables of the voice frame in the voice data, so that when the voice frame is identified by the syllables through the output layer, the auxiliary judgment can be carried out on the syllables of the voice frame based on the pronunciation rule by combining the front and back syllable information of the voice frame, and more accurate syllable probability distribution can be output. Moreover, because syllables are generally composed of one or more phonemes, the fault-tolerant capability is higher, even if pronunciation errors of individual phonemes occur in one syllable in the voice data, the influence on the recognition result of the whole syllable is smaller compared with the phonemes, so that even if the quality of the voice data to be recognized is not high, a more accurate determined voice recognition result can be obtained based on syllable probability distribution, and the application scene of the voice recognition technology is effectively expanded.
The voice recognition device may be a computer device, the computer device may be a server, or may also be a terminal device, the voice recognition device may be embedded in the server or the terminal device, and the computer device provided in the embodiment of the present application will be described below from the perspective of hardware implementation. Fig. 8 is a schematic structural diagram of a server, and fig. 9 is a schematic structural diagram of a terminal device.
Referring to fig. 8, fig. 8 is a schematic diagram of a server 1400 provided by an embodiment of the present application, which may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, CPU 1422 may be configured to communicate with storage medium 1430 to perform a series of instruction operations on server 1400 from storage medium 1430.
The Server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 8.
The CPU 1422 is configured to perform the following steps:
acquiring an acoustic model and voice data to be recognized, wherein the acoustic model comprises a time delay neural network, and an output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables;
using the voice data as input data of the time delay neural network, and determining syllable probability distribution corresponding to the voice frame included in the voice data through the time delay neural network, wherein the syllable probability distribution is used for identifying the probability that the voice frame corresponds to each of the plurality of syllables;
and determining a voice recognition result corresponding to the voice data according to the syllable probability distribution.
Optionally, the CPU 1422 may further perform method steps of any specific implementation manner of the speech recognition method in the embodiment of the present application.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application. Fig. 9 is a block diagram illustrating a partial structure of a smartphone related to a terminal device provided in an embodiment of the present application, where the smartphone includes: a Radio Frequency (RF) circuit 1510, a memory 1520, an input unit 1530, a display unit 1540, a sensor 1550, an audio circuit 1560, a Wireless Fidelity (WiFi) module 1570, a processor 1580, and a power supply 1590. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 9 is not limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
The following specifically describes each component of the smartphone with reference to fig. 9:
the RF circuit 1510 may be configured to receive and transmit signals during information transmission and reception or during a call, and in particular, receive downlink information of a base station and then process the received downlink information to the processor 1580; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1510 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1510 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The input unit 1530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smartphone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on or near the touch panel 1531 using any suitable object or accessory such as a finger or a stylus) and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1580, and can receive and execute commands sent by the processor 1580. In addition, the touch panel 1531 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the smartphone. The Display unit 1540 may include a Display panel 1541, and optionally, the Display panel 1541 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1531 may cover the display panel 1541, and when the touch panel 1531 detects a touch operation on or near the touch panel 1531, the touch operation is transmitted to the processor 1580 to determine the type of the touch event, and then the processor 1580 provides a corresponding visual output on the display panel 1541 according to the type of the touch event. Although in fig. 9, the touch panel 1531 and the display panel 1541 are two separate components to implement the input and output functions of the smartphone, in some embodiments, the touch panel 1531 and the display panel 1541 may be integrated to implement the input and output functions of the smartphone.
The smartphone may also include at least one sensor 1550, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 1541 according to the brightness of ambient light and a proximity sensor that may turn off the display panel 1541 and/or backlight when the smartphone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the smartphone, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the smart phone, further description is omitted here.
WiFi belongs to short-distance wireless transmission technology, and the smart phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1570, and provides wireless broadband internet access for the user. Although fig. 9 shows WiFi module 1570, it is understood that it does not belong to the essential components of the smartphone and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1580 is a control center of the smartphone, connects various parts of the entire smartphone using various interfaces and lines, and performs various functions of the smartphone and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.
The smartphone also includes a power supply 1590 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 1580 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system.
Although not shown, the smart phone may further include a camera, a bluetooth module, and the like, which are not described herein.
In an embodiment of the application, the smartphone includes a memory 1520 that can store program code and transmit the program code to the processor.
The processor 1580 included in the smart phone may execute the voice recognition method provided in the foregoing embodiment according to an instruction in the program code.
The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute the speech recognition method provided by the foregoing embodiment.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the speech recognition method provided in the various alternative implementations of the above aspects.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as Read-only Memory (ROM), RAM, magnetic disk, or optical disk.
It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Moreover, the present application can be further combined to provide more implementations on the basis of the implementations provided by the above aspects. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (12)
1. A method of speech recognition, the method comprising:
acquiring an acoustic model and voice data to be recognized, wherein the acoustic model comprises a time delay neural network, and an output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables;
using the voice data as input data of the time delay neural network, and determining syllable probability distribution corresponding to the voice frame included in the voice data through the time delay neural network, wherein the syllable probability distribution is used for identifying the probability that the voice frame corresponds to each of the plurality of syllables;
and determining a voice recognition result corresponding to the voice data according to the syllable probability distribution.
2. The method of claim 1, wherein the acoustic model further comprises a decoder, and wherein determining the speech recognition result corresponding to the speech data according to the syllable probability distribution comprises:
determining a keyword corresponding to the awakening scene according to the awakening scene corresponding to the voice data;
determining, by the decoder, a speech recognition result for identifying whether the keyword is included in the speech data according to the syllable probability distribution;
the method further comprises the following steps:
and if the voice recognition result indicates that the voice data comprises the keyword, awakening the terminal equipment corresponding to the awakening scene.
3. The method of claim 2, further comprising:
establishing a matching word list aiming at the key words by taking syllables as division granularity;
determining, by the decoder, a speech recognition result for identifying whether the keyword is included in the speech data according to the syllable probability distribution, including:
and determining a voice recognition result for identifying whether the keyword is included in the voice data through the decoder according to the syllable probability distribution and the matching word list.
4. The method of claim 1, wherein the time-delay neural network comprises N layers of feature extraction layers, j e N, and for an ith speech frame in the speech frames included in the speech data, the determining, by using the speech data as input data of the time-delay neural network, syllable probability distributions respectively corresponding to the speech frames included in the speech data through the time-delay neural network comprises:
determining the speech frame characteristics of the ith frame speech frame through a jth layer characteristic extraction layer according to the output characteristics of the jth-1 layer characteristic extraction layer aiming at the ith frame speech frame;
determining the output characteristics of the ith frame of voice frame at the jth layer characteristic extraction layer according to the voice frame characteristics and the voice frame characteristics corresponding to at least one frame before and after the ith frame of voice frame in the voice data;
and determining syllable probability distribution corresponding to the ith frame of voice frame according to the output characteristics of the ith frame of voice frame on the jth layer of characteristic extraction layer.
5. The method of claim 3, further comprising:
determining similar syllables of syllables included in the keyword according to the pronunciation similarity;
if the keyword has polyphone, determining polyphone syllables corresponding to the polyphone;
the establishing of the matching word list aiming at the key words by taking the syllables as the division granularity comprises the following steps:
and according to at least one of the similar syllables and the polyphone syllables and the syllables corresponding to the keyword, constructing a matching word list aiming at the keyword by taking the syllables as the division granularity.
6. The method of claim 1, further comprising:
acquiring a voice sample belonging to the same language as the voice data;
obtaining a prediction result through an initial time delay neural network included in an initial acoustic model according to the voice sample as input data of the initial acoustic model;
determining a loss function according to the prediction result and the sample label of the voice sample, wherein the loss function comprises a guidance weight, and the guidance weight is used for improving the influence of the identification path of the correct prediction result in the initial time delay neural network and reducing the influence of the identification path of the wrong prediction result in the initial time delay neural network;
and adjusting model parameters of an initial time delay neural network in the initial acoustic model according to the loss function to obtain the acoustic model.
7. The method of claim 6, wherein obtaining the speech sample in the same language as the speech data comprises:
acquiring a voice sample to be processed which belongs to the same language as the voice data;
performing data amplification according to the voice sample to be processed to obtain an amplification sample, wherein a sample label of the amplification sample is determined based on a sample label of the corresponding voice sample to be processed;
and obtaining the voice sample according to the voice sample to be processed and the augmentation sample.
8. The method of claim 7, further comprising:
taking a sample label of the voice sample to be processed corresponding to the augmented sample as an undetermined label of the augmented sample;
and carrying out label alignment treatment according to the voice frame of the augmented sample and the undetermined label to obtain a sample label of the augmented sample.
9. A speech recognition apparatus is characterized by comprising an acquisition unit, a syllable probability distribution determination unit and a speech recognition result determination unit;
the acquiring unit is used for acquiring an acoustic model and voice data to be recognized, the acoustic model comprises a time delay neural network, and an output layer of the time delay neural network comprises acoustic modeling units respectively corresponding to a plurality of syllables;
the syllable probability distribution determining unit is configured to use the speech data as input data of the time-delay neural network, and determine, by the time-delay neural network, syllable probability distributions corresponding to speech frames included in the speech data, where the syllable probability distributions are used to identify probabilities that the speech frames correspond to the plurality of syllables, respectively;
and the voice recognition result determining unit is used for determining the voice recognition result corresponding to the voice data according to the syllable probability distribution.
10. A computer device, the computer device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the speech recognition method of any one of claims 1-8 according to instructions in the program code.
11. A computer-readable storage medium for storing a computer program for executing the speech recognition method of any one of claims 1-8.
12. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the speech recognition method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210042387.1A CN114360510A (en) | 2022-01-14 | 2022-01-14 | Voice recognition method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210042387.1A CN114360510A (en) | 2022-01-14 | 2022-01-14 | Voice recognition method and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114360510A true CN114360510A (en) | 2022-04-15 |
Family
ID=81091399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210042387.1A Pending CN114360510A (en) | 2022-01-14 | 2022-01-14 | Voice recognition method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114360510A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798465A (en) * | 2023-02-07 | 2023-03-14 | 天创光电工程有限公司 | Voice input method, system and readable storage medium |
CN116612783A (en) * | 2023-07-17 | 2023-08-18 | 联想新视界(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110211588A (en) * | 2019-06-03 | 2019-09-06 | 北京达佳互联信息技术有限公司 | Audio recognition method, device and electronic equipment |
CN110838289A (en) * | 2019-11-14 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Awakening word detection method, device, equipment and medium based on artificial intelligence |
CN110970036A (en) * | 2019-12-24 | 2020-04-07 | 网易(杭州)网络有限公司 | Voiceprint recognition method and device, computer storage medium and electronic equipment |
CN111402893A (en) * | 2020-03-23 | 2020-07-10 | 北京达佳互联信息技术有限公司 | Voice recognition model determining method, voice recognition method and device and electronic equipment |
CN111681661A (en) * | 2020-06-08 | 2020-09-18 | 北京有竹居网络技术有限公司 | Method, device, electronic equipment and computer readable medium for voice recognition |
CN111916058A (en) * | 2020-06-24 | 2020-11-10 | 西安交通大学 | Voice recognition method and system based on incremental word graph re-scoring |
CN112863485A (en) * | 2020-12-31 | 2021-05-28 | 平安科技(深圳)有限公司 | Accent voice recognition method, apparatus, device and storage medium |
-
2022
- 2022-01-14 CN CN202210042387.1A patent/CN114360510A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110211588A (en) * | 2019-06-03 | 2019-09-06 | 北京达佳互联信息技术有限公司 | Audio recognition method, device and electronic equipment |
CN110838289A (en) * | 2019-11-14 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Awakening word detection method, device, equipment and medium based on artificial intelligence |
CN110970036A (en) * | 2019-12-24 | 2020-04-07 | 网易(杭州)网络有限公司 | Voiceprint recognition method and device, computer storage medium and electronic equipment |
CN111402893A (en) * | 2020-03-23 | 2020-07-10 | 北京达佳互联信息技术有限公司 | Voice recognition model determining method, voice recognition method and device and electronic equipment |
CN111681661A (en) * | 2020-06-08 | 2020-09-18 | 北京有竹居网络技术有限公司 | Method, device, electronic equipment and computer readable medium for voice recognition |
CN111916058A (en) * | 2020-06-24 | 2020-11-10 | 西安交通大学 | Voice recognition method and system based on incremental word graph re-scoring |
CN112863485A (en) * | 2020-12-31 | 2021-05-28 | 平安科技(深圳)有限公司 | Accent voice recognition method, apparatus, device and storage medium |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798465A (en) * | 2023-02-07 | 2023-03-14 | 天创光电工程有限公司 | Voice input method, system and readable storage medium |
CN115798465B (en) * | 2023-02-07 | 2023-04-07 | 天创光电工程有限公司 | Voice input method, system and readable storage medium |
CN116612783A (en) * | 2023-07-17 | 2023-08-18 | 联想新视界(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN116612783B (en) * | 2023-07-17 | 2023-10-27 | 联想新视界(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110890093B (en) | Intelligent equipment awakening method and device based on artificial intelligence | |
CN107134279B (en) | Voice awakening method, device, terminal and storage medium | |
US20220013111A1 (en) | Artificial intelligence-based wakeup word detection method and apparatus, device, and medium | |
CN110288077B (en) | Method and related device for synthesizing speaking expression based on artificial intelligence | |
CN109697973B (en) | Rhythm level labeling method, model training method and device | |
CN111261144B (en) | Voice recognition method, device, terminal and storage medium | |
CN108694940B (en) | Voice recognition method and device and electronic equipment | |
US7529671B2 (en) | Block synchronous decoding | |
CN110570840B (en) | Intelligent device awakening method and device based on artificial intelligence | |
JP2022537011A (en) | AI-BASED VOICE-DRIVEN ANIMATION METHOD AND APPARATUS, DEVICE AND COMPUTER PROGRAM | |
CN110853618A (en) | Language identification method, model training method, device and equipment | |
CN110853617B (en) | Model training method, language identification method, device and equipment | |
CN110473531A (en) | Audio recognition method, device, electronic equipment, system and storage medium | |
CN107680585B (en) | Chinese word segmentation method, Chinese word segmentation device and terminal | |
CN110534099A (en) | Voice wakes up processing method, device, storage medium and electronic equipment | |
CN110634474B (en) | Speech recognition method and device based on artificial intelligence | |
CN107291690A (en) | Punctuate adding method and device, the device added for punctuate | |
US11568853B2 (en) | Voice recognition method using artificial intelligence and apparatus thereof | |
CN113393828A (en) | Training method of voice synthesis model, and voice synthesis method and device | |
CN107767861A (en) | voice awakening method, system and intelligent terminal | |
CN107155121B (en) | Voice control text display method and device | |
US11532301B1 (en) | Natural language processing | |
CN111833866A (en) | Method and system for high accuracy key phrase detection for low resource devices | |
CN114360510A (en) | Voice recognition method and related device | |
CN113096640B (en) | Speech synthesis method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |