WO2022068233A1 - 一种语音识别的方法、装置及计算机可读存储介质 - Google Patents

一种语音识别的方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2022068233A1
WO2022068233A1 PCT/CN2021/096848 CN2021096848W WO2022068233A1 WO 2022068233 A1 WO2022068233 A1 WO 2022068233A1 CN 2021096848 W CN2021096848 W CN 2021096848W WO 2022068233 A1 WO2022068233 A1 WO 2022068233A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectrogram
frames
preset
model
network part
Prior art date
Application number
PCT/CN2021/096848
Other languages
English (en)
French (fr)
Inventor
李健
韩雨
武卫东
陈明
Original Assignee
北京捷通华声科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京捷通华声科技股份有限公司 filed Critical 北京捷通华声科技股份有限公司
Publication of WO2022068233A1 publication Critical patent/WO2022068233A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Definitions

  • the present disclosure relates to speech recognition technology and deep learning technology, and in particular, to a speech recognition method, apparatus, electronic device, and computer-readable storage medium.
  • MFCC features In speech recognition, most traditional speech features are used for speech recognition.
  • FBANK features In speech recognition, most traditional speech features are used for speech recognition.
  • traditional speech features include: MFCC features, FBANK features and other artificially designed features, which cause information loss in the frequency domain, especially in the high-frequency region, resulting in low accuracy of speech recognition.
  • the traditional single-task network model can easily overfit on the training data, resulting in a decrease in the recognition rate on the test set.
  • embodiments of the present disclosure are proposed to provide a speech recognition method, apparatus, electronic device, and computer-readable storage medium that overcome the problems or at least partially solve the problems.
  • an embodiment of the present disclosure discloses a method for speech recognition, and the method includes:
  • the number of frames of the spectrogram is not the preset number of frames, then zero-padded the spectrogram, so that the number of frames of the spectrogram to be recognized obtained after zero-filling is the preset frame number;
  • the method further includes:
  • the preset model includes a main network part and a branch network part; wherein the main network part is used for outputting Texts corresponding to a plurality of the spectrogram samples, and the branch network part is used for outputting reconstructed images corresponding to the plurality of the spectrogram samples;
  • the preset model at the end of training is used as the acoustic model.
  • the step of training the preset model includes:
  • a plurality of the spectrogram samples are respectively input into the branch network part, and the reconstructed image corresponding to each of the spectrogram samples is obtained, and according to the plurality of the spectrogram samples and the corresponding the reconstructed image corresponding to the spectrogram sample, and obtain the loss function of the branch network part;
  • a plurality of the spectrogram samples are input into the preset model for training until the loss function of the preset model converges.
  • the step of obtaining the recognized text output by the acoustic model includes:
  • the method also includes:
  • the first score and the second score determine the final scores corresponding to a plurality of the texts to be recognized respectively;
  • the step of obtaining multiple spectrogram samples includes:
  • an embodiment of the present disclosure discloses an apparatus for speech recognition, and the apparatus includes:
  • An audio conversion module for converting the acquired audio data into a corresponding spectrogram
  • a frame number judgment module used for judging whether the frame number of the spectrogram is a preset frame number
  • a zero-filling module used for zero-filling the spectrogram if the number of frames of the spectrogram is not the preset number of frames, so that the number of frames of the spectrogram to be recognized obtained after zero-filling is the preset number of frames;
  • a spectrogram input module for inputting the to-be-recognized spectrogram into an acoustic model
  • a recognized text obtaining module is used to obtain the recognized text output by the acoustic model.
  • the device further includes:
  • the sample acquisition module is used to acquire multiple spectrogram samples
  • a model training module for inputting a plurality of the spectrogram samples into a preset model to train the preset model, the preset model including a main network part and a branch network part; wherein the The main network part is used for outputting the text corresponding to a plurality of the spectrogram samples, and the branch network part is used for outputting the reconstructed images corresponding to the plurality of the spectrogram samples; model as the acoustic model.
  • model training module includes:
  • the CTC loss function acquisition sub-module is configured to acquire the CTC loss function of the main network part according to the main network part, the text label and a plurality of the spectrogram samples;
  • the loss function acquisition sub-module of the branch network part is used for inputting a plurality of the spectrogram samples into the branch network part respectively, obtaining the reconstructed image corresponding to each of the spectrogram samples, and according to the plurality of spectrogram samples.
  • the spectrogram sample and the reconstructed image corresponding to each spectrogram sample obtain the loss function of the branch network part;
  • a loss function determination submodule of the preset model configured to determine the loss function of the preset model according to the CTC loss function, the loss function of the branch network part and the preset coefficient;
  • a model training sub-module configured to input a plurality of the spectrogram samples into the preset model for training until the loss function of the preset model converges.
  • an embodiment of the present disclosure further discloses an electronic device, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processing The steps of implementing the method for speech recognition according to the first aspect when the computer executes the program.
  • an embodiment of the present disclosure further discloses a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the Describe the steps of a speech recognition method of the first aspect.
  • the spectrogram by converting the acquired audio data into a corresponding spectrogram, it is judged whether the number of frames of the spectrogram is the preset number of frames; if the number of frames of the spectrogram is not the preset number of frames The number of frames is zero-padded to the spectrogram, so that the number of frames of the spectrogram to be recognized obtained after zero-filling is the preset number of frames; the spectrogram to be recognized is input into the acoustic model.
  • the present disclosure reduces the loss of the input feature and increases the The recognition of audio data.
  • the present disclosure performs a zero-padding operation for the spectrogram whose frame number is not the preset number of frames, so that the graph of the spectrogram after the zero-padding is smoother, the recognition degree is increased, and the acoustic model is more conducive to extracting the spectrogram. feature information on the graph.
  • FIG. 1 is a flow chart of application steps of a method for speech recognition of the present disclosure
  • FIG. 2 is a flow chart of the method steps of a speech recognition of the present disclosure
  • FIG. 3 is a structural block diagram of an embodiment of an apparatus for speech recognition according to the present disclosure.
  • the core idea of the present disclosure is to determine the loss function of the preset model according to the text label of the spectrogram and the reconstructed image, directly input the spectrogram to the acoustic model obtained by training, and the acoustic model outputs the recognized text.
  • the present disclosure reduces the loss of the input feature and increases the recognition degree of audio data.
  • the loss function of the present disclosure not only considers the text label, but also considers the reconstructed image, which reduces the overfitting of the acoustic model and improves the speech recognition rate.
  • FIG. 1 shows a flowchart of the application steps of a speech recognition method of the present disclosure, which may specifically include the following steps:
  • Step 101 Convert the acquired audio data into a corresponding spectrogram.
  • a spectrogram is a three-dimensional spectrum, which is a graph representing the change of speech spectrum over time, the vertical axis is frequency, the horizontal axis is time, and the coordinate point value is language data energy.
  • the strength of any given frequency component at a given moment is represented by the grayscale or shade of the corresponding point.
  • Fourier transform is performed on the acquired audio data to obtain the corresponding frequency, and then a time-frequency spectrogram is generated.
  • Step 102 Determine whether the number of frames of the spectrogram is a preset number of frames.
  • an Acoustic Model is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accent, and the like.
  • the acoustic model in the embodiment of the present disclosure may be an HMM acoustic model, a DNN-HMM acoustic model, a FFDNN acoustic model, a CNN acoustic model, a CTC acoustic model, etc.
  • the embodiment of the present disclosure does not limit the specific acoustic model, and an appropriate acoustic model can be selected according to the actual situation acoustic model.
  • the acoustic model has a size requirement on the input spectrogram, so the height and the number of frames of the spectrogram need to be limited.
  • a corresponding preset number of frames is set according to the size requirements of the acoustic model, for example, the preset number of frames is 700 frames.
  • the audio data is converted into the corresponding spectrogram, it is necessary to determine whether the spectrogram meets the size requirements of the acoustic model, that is, to determine whether the number of frames of the spectrogram meets the preset number of frames, and it is possible to determine whether the spectrogram meets the size requirements of the acoustic model. Whether the height meets the preset height.
  • Step 103 If the number of frames of the spectrogram is not the preset number of frames, then zero-padded the spectrogram, so that the number of frames of the spectrogram to be recognized obtained after zero-filling is the preset number of frames.
  • the spectrogram when the number of frames of the spectrogram is less than the preset number of frames, the spectrogram will be zero-filled. It is more conducive to the acoustic model to extract the feature information on the spectrogram, and also meets the requirements of the acoustic model for the number of frames. At the same time, if the height of the spectrogram does not meet the height requirement of the acoustic model, the spectrogram will also be enlarged or reduced so that the changed spectrogram meets the height requirement of the acoustic model.
  • zero-padding is to add the number of sampling points for each frame of the spectrogram.
  • Step 104 Input the above-mentioned spectrogram to be recognized into the acoustic model.
  • the spectrogram that meets the input requirements of the acoustic model is input into the acoustic model.
  • Step 105 Obtain the recognized text output by the above acoustic model.
  • the acoustic model extracts the frames in the spectrogram to be recognized in chronological order, sequentially outputs a plurality of texts matching the corresponding frames, and scores each text.
  • a spectrogram includes 30 frames, then the spectrogram is input into a suitable acoustic model, and the acoustic model outputs "you", “you", “another”, “example” for the first 15 frames, and outputs the pair of The score of "you” is 0.5, the score of "you” is 0.3, the score of "other” is 0.1, and the score of "example” is 0.1; Output a score of 0.2 for "Number", a score of "0.6” for "Good”, and a score of "0.2" for "Hao".
  • the frame number of the above-mentioned spectrogram is the preset number of frames; if the frame number of the above-mentioned spectrogram is not the preset number of frames , the above-mentioned spectrogram is zero-padded, so that the number of frames of the to-be-recognized spectrogram obtained after zero-filling is the above-mentioned preset number of frames; and the above-mentioned to-be-recognized spectrogram is input into the acoustic model.
  • the present disclosure reduces the loss of the input feature and increases the Recognition of audio data.
  • the present disclosure performs a zero-padding operation for the spectrogram whose frame number is not the preset number of frames, so that the graph of the spectrogram after the zero-padding is smoother, the recognition degree is increased, and the acoustic model is more conducive to extracting the spectrogram. feature information on the graph.
  • FIG. 2 shows a flowchart of the steps of a speech recognition method of the present disclosure, which may specifically include the following steps:
  • Step 201 acquiring multiple spectrogram samples.
  • multiple pieces of audio data are acquired, and the multiple pieces of audio data are converted into corresponding multiple spectrograms;
  • Figure perform zero-filling operation, so that the number of frames of the spectrogram obtained after zero-filling is equal to the above-mentioned preset number of frames; delete a plurality of the above-mentioned spectrograms whose frame number is greater than the above-mentioned preset number of frames, and compare The remaining spectrograms are subjected to data enhancement to obtain a plurality of the above spectrogram samples.
  • a plurality of pieces of audio data are obtained first, and each piece of audio data is Fourier transformed into a frequency, and then a corresponding time-frequency spectrogram is generated according to the order of conversion into frequency; then, the spectrogram is judged
  • the size relationship between the number of frames and the preset number of frames, the preset number of frames is the input frame size of the acoustic model; wherein, the spectrogram whose number of frames is less than the preset number of frames needs to be zero-filled, so that the zero-filled
  • the number of frames of the spectrogram is equal to the preset number of frames; wherein, the spectrogram with the number of frames greater than the preset number of frames is discarded.
  • the heights of these spectrograms must be consistent and meet the input requirements of the acoustic model; finally, the spectrograms that meet the input requirements of the acoustic model are expanded by the number of spectrogram samples and data enhancement, including distorting the time domain signal, masking Frequency domain channel, and masking time domain channel and other ways to modify the spectrogram.
  • This enhancement method can increase the robustness of the network and improve the recognition rate, and the increased number can also be adjusted according to the actual effect.
  • SpecAugment (a data enhancement method) is used to perform data augmentation and enhancement on the spectrogram samples. Specifically: copy the about 100,000 spectrograms, and modify the spectrograms by distorting the time domain signal, masking the frequency domain channel, and masking the time domain channel on the copied spectrogram.
  • the final spectrogram sample is as follows: Double the expansion to get about 200,000 spectrograms.
  • Step 202 Input a plurality of the above-mentioned spectrogram samples into a preset model to train the above-mentioned preset model.
  • the above-mentioned preset model includes a main network part and a branch network part; wherein, the above-mentioned main network part is used to output the text corresponding to the sample, and the above-mentioned branch network part is used to reconstruct the above-mentioned input spectrogram.
  • the specific training process is as follows :
  • a plurality of above-mentioned spectrogram samples are respectively input into the above-mentioned branch network part, and the reconstructed image corresponding to each spectrogram sample is obtained, and according to the above-mentioned multiple above-mentioned spectrogram samples and the above-mentioned reconstruction image corresponding to each spectrogram sample, Obtain the loss function of the above branch network part;
  • a plurality of the above-mentioned spectrogram samples are input into the above-mentioned preset model for training, until the convergence of the loss function of the above-mentioned preset model ends.
  • the audio data corresponding to each spectrogram in all spectrogram samples is identified by manual recognition, and the corresponding text label of each spectrogram is obtained.
  • each text label is the correct text represented by the corresponding spectrogram.
  • All spectrograms and corresponding text labels are input to the main network part, and the output text sum of the main network is compared with the corresponding text labels to determine the difference between the two.
  • the final difference between all the spectrograms and the corresponding text labels can be determined by averaging all the differences, and the CTC loss function of the main network part can be determined according to the final difference.
  • All spectrogram samples are input into the branch network part one by one, and a reconstructed image corresponding to each spectrogram sample is obtained, wherein the reconstructed image is a re-restored image of the input spectrogram.
  • the loss function of the branch network is determined according to the mean square error between each spectrogram and its corresponding reconstructed image, that is, the average sum of squares of the distances of all the output reconstructed images deviating from the corresponding input spectrogram.
  • the function of the branch network is to regularize, avoid overfitting of the preset model, and improve the recognition rate of the model.
  • the CTC loss function the sum of the loss function of the above-mentioned branch network part multiplied by the preset coefficient
  • the preset coefficient is a value between 0 and 1, and the coefficient can be adjusted according to the training result of the preset model. After the adjustment, the preset model needs to be retrained until the loss function of the preset model is completely converged. Use the preset model at the end of training as the acoustic model.
  • the function of the branch network part is to provide a loss function corresponding to the reconstructed image.
  • Step 203 Convert the acquired audio data into a corresponding spectrogram.
  • a piece of audio data that needs to be set is acquired, the audio data is Fourier transformed to obtain a corresponding frequency, and then a time-frequency spectrogram is generated according to the time sequence of the frequency.
  • Step 204 judging whether the number of frames of the spectrogram is a preset number of frames.
  • the acoustic model Since the acoustic model has requirements on the size of the input spectrogram, it is necessary to judge whether the width of the pre-input spectrogram meets the requirements. That is, it is judged whether the number of frames of the spectrogram meets the size requirement of the acoustic model.
  • the height of the spectrogram meets the optimal input requirements of the acoustic model, so as to adjust the height of the spectrogram.
  • Step 205 if the number of frames of the spectrogram is not the preset number of frames, the spectrogram is zero-padded, so that the number of frames of the spectrogram to be recognized obtained after zero-filling is the preset number of frames.
  • the graph of the spectrogram after zero-filling is smoother, which increases the recognition degree.
  • the spectrogram can be cropped so that the number of frames of each cropped spectrogram is less than or equal to the preset number of frames, and the cropped spectrogram One by one, it is sent to the acoustic model for identification.
  • image processing can also be performed on the spectrogram whose height exceeds or is less than the height required by the acoustic model, so that the height of the processed spectrogram meets the height requirement of the acoustic model.
  • image processing technology please refer to the prior art.
  • Step 206 Input the above-mentioned spectrogram to be recognized into the acoustic model to obtain the recognized text output by the above-mentioned acoustic model.
  • the spectrogram to be recognized is input into the acoustic model, and the acoustic model outputs a plurality of texts represented by the acoustic features of each frame in the spectrogram, and the score of each text.
  • Step 207 obtaining the final recognized text through the language model.
  • the language model (Language Model, LM for short).
  • the language model is a knowledge representation composed of a set of word (word) sequences, and its purpose is to make the output text as grammatical as possible and smooth.
  • the language model may be a TF-IDE language model, an N-gram language model, a Word2vec language model, a CBOW language model, a Glove language model, etc.
  • the embodiment of the present disclosure does not impose specific limitations, and can be based on specific situations. Determine which language model to use.
  • the language model can be obtained by training a large amount of plain text corpus.
  • the plain text corpus can be text label information corresponding to spectrogram samples, or other text information, such as news obtained by using crawler technology. Wait.
  • the acoustic model determines the text corresponding to each frame of acoustic features in the acoustic model at the physical level, the text does not meet the actual needs of people, and needs to be adjusted through the language model.
  • the output text of the acoustic model is fed into the language model, which determines the best recognized text based on the dictionary. Specifically, the acoustic model outputs multiple texts and outputs scores for the corresponding texts, which are single words.
  • the language model receives these single words and the scores corresponding to each word in order.
  • the language model regroups and corrects these words according to the dictionary, outputs multiple texts and scores these texts, and finally combines the scores of the acoustic model and language.
  • the scoring of the model determines the best text.
  • a spectrogram includes 30 frames, then the spectrogram is input into a suitable acoustic model, and the acoustic model outputs "you", “you”, “another” for the first 15 frames, and outputs "you” for the first 15 frames. Score 0.5, score 0.3 for "you", score 0.2 for “other”; then output “good” for the last 15 frames, and output a score of "1" for "good”. These texts are input into the language model, and the language model outputs "Hello” and the corresponding score of 0.4, “Hello” and the corresponding score of 0.4, and "Another good” and the corresponding score of 0.2.
  • the scoring of the acoustic model and the scoring of the language model may also be combined by assigning weights to the scores of the acoustic model and the language model respectively, and combining the weights to determine the most total recognized text. be limited.
  • audio data is directly converted into a spectrogram and then used for text recognition in an acoustic model, which reduces and compensates for the loss of feature information in the frequency domain caused by traditional calculation of MFCC features.
  • the loss function of the embodiment of the present disclosure not only considers the text label, but also uses the reconstructed image as a regular term, which improves the recognition rate of the acoustic model and is more beneficial for the acoustic model to extract feature information on the spectrogram.
  • the spectrogram in the embodiment of the present disclosure is directly input into the acoustic model, which can be used for speech recognition, speech recognition-based speech navigation, speech quality inspection and other services that contain speech recognition requirements, and has a wide range of applications and is accurate. high degree.
  • FIG. 3 shows a structural block diagram of an embodiment of an apparatus for speech recognition of the present disclosure.
  • the specific devices are as follows:
  • Audio conversion module 301 for converting the acquired audio data into corresponding spectrogram
  • the frame number judgment module 302 is used for judging whether the frame number of the above-mentioned spectrogram is a preset frame number
  • the zero-filling module 303 is used to perform zero-filling on the above-mentioned spectrogram if the frame number of the above-mentioned spectrogram is not the preset number of frames, so that the frame number of the to-be-recognized spectrogram obtained after the zero-filling is the above-mentioned predetermined number of frames. Set the number of frames;
  • the spectrogram input module 304 is used to input the above-mentioned spectrogram to be recognized into the acoustic model
  • the recognized text obtaining module 305 is configured to obtain the recognized text output by the above acoustic model.
  • the above-mentioned device also includes:
  • the sample acquisition module is used to acquire multiple spectrogram samples
  • a model training module is used to input a plurality of the above-mentioned spectrogram samples into a preset model to train the above-mentioned preset model, and the above-mentioned preset model includes a main network part and a branch network part; wherein, the above-mentioned main network part uses For outputting texts corresponding to the plurality of spectrogram samples, the branch network part is used for outputting reconstructed images corresponding to the plurality of spectrogram samples; the preset model at the end of the training is used as the acoustic model.
  • a first score obtaining module configured to obtain a plurality of texts to be recognized output by the acoustic model and a first score corresponding to the plurality of texts to be recognized;
  • a recognized text input module used for inputting a plurality of the above-mentioned texts to be recognized into the above-mentioned language model respectively;
  • a second score obtaining module configured to obtain the second scores obtained by the above-mentioned language model for recognizing a plurality of the above-mentioned texts to be recognized respectively;
  • a final score module configured to determine the final scores corresponding to the plurality of above-mentioned texts to be recognized respectively according to the above-mentioned first score and the above-mentioned second score;
  • the final recognized text determination module is configured to compare the final scores corresponding to the recognized texts, and determine the text to be recognized corresponding to the highest final score as the final recognized text.
  • the above-mentioned sample acquisition module specifically includes the following sub-modules:
  • Audio data conversion submodule for obtaining multiple pieces of audio data, and converting multiple pieces of above-mentioned audio data into corresponding multiple spectrograms
  • the zero-filling sub-module is used to perform zero-filling operations on multiple spectrograms whose frame numbers are less than the above-mentioned preset number of frames in the above-mentioned spectrogram, so that the number of frames of the spectrogram obtained after zero-filling is equal to the above-mentioned preset number number of frames;
  • the data enhancement sub-module is used for deleting a plurality of spectrograms whose frame numbers are greater than the preset number of frames in the above-mentioned spectrograms, and performing data enhancement on the remaining spectrograms to obtain a plurality of the above-mentioned spectrogram samples.
  • the above model training module includes:
  • the CTC loss function acquisition sub-module is used to obtain the CTC loss function of the above-mentioned main network part according to the above-mentioned main network part, the text label and a plurality of the above-mentioned spectrogram samples;
  • the loss function acquisition sub-module of the branch network part is used to input a plurality of the above-mentioned spectrogram samples into the above-mentioned branch network part respectively, obtain the reconstructed image corresponding to each spectrogram sample, and according to the plurality of above-mentioned spectrogram samples and The above-mentioned reconstructed image corresponding to each spectrogram sample is obtained, and the loss function of the above-mentioned branch network part is obtained;
  • the loss function determination sub-module of the preset model is used to determine the loss function of the preset model according to the CTC loss function, the loss function of the branch network part and the preset coefficient;
  • the model training sub-module is used for inputting a plurality of the above-mentioned spectrogram samples into the above-mentioned preset model for training until the loss function of the above-mentioned preset model converges.
  • another embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor When the processor is executed, any one of the above-mentioned implementations of the present disclosure is implemented. Example of the steps in the above method.
  • another embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps in the above-mentioned method according to any of the above-mentioned embodiments of the present disclosure.
  • embodiments of the embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present disclosure may take the form of a computer program product implemented on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.
  • Embodiments of the present disclosure are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present disclosure. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowcharts and/or the blocks or blocks of the block diagrams.
  • These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the The instruction means implement the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音识别的方法、装置及计算机可读存储介质。该方法包括:将获取的音频数据转化为对应的语谱图(101);判断语谱图的帧数是否为预设帧数(102);若语谱图的帧数不为预设帧数,则对语谱图进行补零,以使补零后得到的待识别语谱图的帧数为预设帧数(103);将待识别语谱图输入到声学模型(104);获得声学模型输出的识别文本(105)。相较现有技术计算MFCC特征造成的频域上的信息损失,该方案减少了输入特征的损失,增加了音频数据的辨识度,并且更加有利于声学模型提取特征信息。

Description

一种语音识别的方法、装置及计算机可读存储介质
本公开以2020年09月29日递交的、申请号为202011046734.5且名称为“一种语音识别的方法、装置、设备及介质”的专利文件为优先权文件,该文件的全部内容通过引用结合在本公开中。
技术领域
本公开涉及语音识别技术和深度学习技术,具体涉及一种语音识别的方法、装置、电子设备及计算机可读存储介质。
背景技术
随着智能化产品的普及,作为人机交互的语音识别技术越现重要。
在语音识别中,目前大多数采用传统语音特征进行语音识别。其中,传统语音特征包括:MFCC特征、FBANK特征等各种人工设计特征,这造成了频域上的信息损失,在高频区域的信息损失尤为明显,导致对语音识别的准确率不高。同时,传统的单任务网络模型可容易在训练数据上过拟合,导致对测试集上识别率下降。
发明内容
鉴于所述问题,提出了本公开实施例以便提供一种克服所述问题或者至少部分地解决所述问题的一种语音识别的方法、装置、电子设备及计算机可读存储介质。
第一方面,为了解决所述问题,本公开实施例公开了一种语音识别的方法,所述方法包括:
将获取的音频数据转化为对应的语谱图;
判断所述语谱图的帧数是否为预设帧数;
若所述语谱图的帧数不为所述预设帧数,则对所述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为所述预设帧数;
将所述待识别语谱图输入到声学模型;
获得所述声学模型输出的识别文本。
可选地,所述方法还包括:
获取多个语谱图样本;
将多个所述语谱图样本输入至预设模型中,以对所述预设模型进行训练,所述预设模型包括主体网络部分和分支网络部分;其中,所述主体网络部分用于输出与多个所述语谱图样本对应的文本,所述分支网络部分用于输出与多个所述语谱图样本对应的重建图像;
将训练结束时的所述预设模型作为所述声学模型。
可选地,所述对所述预设模型进行训练的步骤,包括:
根据所述主体网络部分、文本标签和多个所述语谱图样本,获取所述主体网络部分的CTC损失函数;
将多个所述语谱图样本分别输入到所述分支网络部分,获取与各所述语谱图样本对应的所述重建图像,并根据多个所述语谱图样本和与各所述语谱图样本对应的所述重建图像,获取所述分支网络部分的损失函数;
根据所述CTC损失函数、所述分支网络部分的损失函数和预设系数,确定所述预设模型的损失函数;
将多个所述语谱图样本输入到所述预设模型中进行训练,直到所述预设模型的所述损失函数收敛。
可选地,所述获得所述声学模型输出的识别文本的步骤,包括:
获得所述声学模型输出的多个待识别文本和与多个所述待识别文本分别对应的第一得分;
所述方法还包括:
将多个所述待识别文本分别输入到语言模型;
获得所述语言模型分别对多个所述待识别文本进行识别的第二得分;
根据所述第一得分和所述第二得分,确定多个所述待识别文本分别对应的最终得分;
比较各所述识别文本对应的最终得分,确定最高的所述最终得分对应的所述待识别文本为最终的所述识别文本。
可选地,所述获取多个语谱图样本的步骤,包括:
获取多条音频数据,并将多条所述音频数据转化为对应的多个所述语谱图;
对多个所述语谱图中帧数少于所述预设帧数的所述语谱图,进行补零操作,使得补零后得到的语谱图的帧数等于所述预设帧数;
将多个所述语谱图中帧数大于所述预设帧数的所述语谱图删除,并对剩下的所述语谱图进行数据增强,获得多个所述语谱图样本。
第二方面,为了解决所述问题,本公开实施例公开了一种语音识别的装置,所述装置包括:
音频转化模块,用于将获取的音频数据转化为对应的语谱图;
帧数判断模块,用于判断所述语谱图的帧数是否为预设帧数;
补零模块,用于若所述语谱图的帧数不为所述预设帧数,则对所述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为所述预设帧数;
语谱图输入模块,用于将所述待识别语谱图输入到声学模型;
识别文本获得模块,用于获得所述声学模型输出的识别文本。
可选地,所述装置还包括:
样本获取模块,用于获取多个语谱图样本;
模型训练模块,用于将多个所述语谱图样本输入至预设模型中,以对所述预设模型进行训练,所述预设模型包括主体网络部分和分支网络部分;其中,所述主体网络部分用于输出与多个所述语谱图样本对应的文本,所述分支网络部分用于输出与多个所述语谱图样本对应的重建图像;将训练结束时的所述预设模型作为所述声学模型。
可选地,所述模型训练模块,包括:
CTC损失函数获取子模块,用于根据所述主体网络部分、文本标签和多个所述语谱图样本,获取所述主体网络部分的CTC损失函数;
分支网络部分的损失函数获取子模块,用于将多个所述语谱图样本分别输入到所述分支网络部分,获取与各所述语谱图样本对应的所述重建图像,并根据多个所述语谱图样本和与各所述语谱图样本对应的所述重建图像,获取所述分支网络部分的损失函数;
预设模型的损失函数确定子模块,用于根据所述CTC损失函数、所述分支网络部分的损失函数和预设系数,确定所述预设模型的损失函数;
模型训练子模块,用于将多个所述语谱图样本输入到所述预设模型中进行训练,直到所述预设模型的所述损失函数收敛。
第三方面,为了解决所述问题,本公开实施例还公开了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现如所述第一方面的一种语音识别的方法的步骤。
第四方面,为了解决所述问题,本公开实施例还公开了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如所述第一方面的一种语音识别的方法的步骤。
本公开实施例包括以下优点:
在本公开实施例中,通过将获取的音频数据转化为对应的语谱图;判断所述语谱图的帧数是否为预设帧数;若所述语谱图的帧数不为预设帧数,则对所述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为所述预设帧数;将所述待识别语谱图输入到声学模型。实现了音频数据的文本识别。同时,因是将符合预设帧数的语谱图直接输入声学模型中进行识别,所以相较现有技术计算MFCC特征造成的频域上的信息损失,本公开减少了输入特征的损失,增加了音频数据的辨识度。此外,本公开针对帧数不为预设帧数的语谱图进行了补零操作,使得补零后的语谱图的图形更加的平滑,增加了辨识度,更加有利于声学模型提取语谱图上的特征信息。
附图说明
图1是本公开的一种语音识别的方法应用步骤流程图;
图2是本公开的一种语音识别的方法步骤流程图;
图3是本公开的一种语音识别的装置实施例的结构框图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显 示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
本公开的核心思想是:根据语谱图的文本标签和重建图像确定预设模型的损失函数,直接向训练获得的声学模型输入语谱图,声学模型输出识别文本。使得相较现有技术计算MFCC特征造成的频域上的信息损失,本公开减少了输入特征的损失,增加了音频数据的辨识度。同时,本公开的损失函数不仅考虑了文本标签,还考虑了重建图像,减轻了声学模型的过拟合,提高了语音识别率。
参见图1,图1示出了本公开的一种语音识别的方法应用步骤流程图,具体可以包括如下步骤:
步骤101,将获取的音频数据转化为对应的语谱图。
本公开中,语谱图是一种三维频谱,它是表示语音频谱随时间变化的图形,其纵轴为频率,横轴为时间,坐标点值为语言数据能量。任一给定频率成分在给定时刻的强弱用相应点的灰度或色调的浓淡来表示。
在本公开实施例中,将获取的音频数据进行傅里叶变换得到对应的频率,之后生成时间-频率的语谱图。
步骤102,判断上述语谱图的帧数是否为预设帧数。
在本公开中,声学模型(Acoustic Model,简称AM)是对声学、语音学、环境的变量、说话人性别、口音等的差异的知识表示。本公开实施例中声学模型可以为HMM声学模型、DNN-HMM声学模型、FFDNN声学模型、CNN声学模型、CTC声学模型等,本公开实施例不限制具体的声学模型,可按照实际情况选择合适的声学模型。
在本公开实施例中,声学模型对输入的语谱图有大小要求,因此需要限定语谱图的高度和帧数。
一般根据声学模型的大小要求设置一个对应的预设帧数,例如预设帧数是700帧。当音频数据转化为对应的语谱图后,需要判断该语谱图是否满足 声学模型的大小要求,即判断该语谱图的帧数是否满足预设帧数,以及可以判断该语谱图的高度是否满足预设高度。
步骤103,若上述语谱图的帧数不为预设帧数,则对上述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为上述预设帧数。
在本公开实施例中,当语谱图的帧数小于预设帧数时,会对该语谱图进行补零,补零后的语谱图不仅让语谱图更加的平辨滑,增加了识度,更加有利于声学模型提取语谱图上的特征信息,还满足了声学模型对帧数的要求。同时,若该语谱图的高度不满足声学模型的高度要求时,还会对语谱图进行放大或缩小操作,以使被改变后的语谱图满足声学模型的高度要求。
本公开中,补零为针对语谱图的每帧增添采样点的数量。
步骤104,将上述待识别语谱图输入到声学模型。
在本公开实施例中,将符合声学模型输入要求的语谱图输入到声学模型中。
步骤105,获得上述声学模型输出的识别文本。
在本公开实施例中,声学模型将待识别语谱图中的帧按照时间先后顺序进行提取,依次输出与对应帧匹配的多个文本,并对每个文本的打分。
例如,一个语谱图包括30帧,那么将该语谱图输入到合适的声学模型中,声学模型针对前15帧输出“你”、“您”、“另”、“例”,并输出对“你”的打分0.5、关于“您”的打分0.3、关于“另”的打分0.1、关于“例”的打分0.1;接着针对后15帧输出“号”、“好”、“豪”,并输出关于“号”的打分0.2、关于“好”的打分“0.6”、关于“豪”的打分“0.2”。
在本公开实施例中,通过将获取的音频数据转化为对应的语谱图;判断上述语谱图的帧数是否为预设帧数;若上述语谱图的帧数不为预设帧数,则对上述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为上述预设帧数;将上述待识别语谱图输入到声学模型。实现了音频数据的文本识别。同时,因是将符合预设帧数的语谱图直接输入声学模型中进行识别,相较现有技术计算MFCC特征造成的频域上的信息损失,本公开减少了输入特征的损失,增加了音频数据的辨识度。此外,本公开针对帧数不为预设帧数的语 谱图进行了补零操作,使得补零后的语谱图的图形更加的平滑,增加了辨识度,更加有利于声学模型提取语谱图上的特征信息。
参见图2,图2示出了本公开的一种语音识别的方法步骤流程图,具体可以包括如下步骤:
步骤201,获取多个语谱图样本。
在本公开实施例中,获取多条音频数据,并将多条上述音频数据转化为对应的多个语谱图;对多个上述语谱图中帧数少于上述预设帧数的语谱图,进行补零操作,使得补零后得到的语谱图的帧数等于上述预设帧数;将多个上述语谱图中帧数大于上述预设帧数的语谱图删除,并对剩下的语谱图进行数据增强,获得多个上述语谱图样本。
具体的,先获取多条音频数据,并对每条音频数据进行傅里叶变换成频率,接着按照转化为频率的先后顺序生成对应的时间-频率语谱图;接着,判断该语谱图中的帧数和预设帧数的大小关系,该预设帧数为声学模型的输入帧大小;其中,帧数小于预设帧数的语谱图需要进行补零操作,以使补零后的语谱图的帧数等于预设帧数;其中,帧数大于预设帧数的语谱图抛弃。同时,这些语谱图的高度需一致,且满足声学模型的输入要求;最后,对满足声学模型输入要求的语谱图进行扩充语谱图样本的数量并数据增强,包括扭曲时域信号,掩盖频域通道,和掩盖时域通道等方式修改语谱图。这种增强方式可以增加网络的鲁棒性,提高识别率,增加的数量也可根据实际效果调整。
例如,现有10万句长短不一的8k语音数据,对每一句语音数据进行傅里叶变换得到频率,然后生成时间—频率的语谱图,其中所有语谱图高度均为8000,宽度则为每句语音的帧数。由于声学模型要求输入的语谱图大小统一,因此设定一个阈值如700帧,对所有不满700帧的语谱图补零为700帧,少数超过700帧的语谱图抛弃,得到大约10万幅长度为700,高度为8000的语谱图。之后采用如SpecAugment(一种数据增强的方法)对语谱图样本进行数据扩充增强。具体为:复制这大约10万幅语谱图,并对复制得到的语谱图进行包括扭曲时域信号,掩盖频域通道,和掩盖时域通道等方式修改 频谱图,最终语谱图样本如扩充1倍得到大约20万幅语谱图。
步骤202,将多个上述语谱图样本输入至预设模型中,以对上述预设模型进行训练。
具体的,上述预设模型包括主体网络部分和分支网络部分;其中,上述主体网络部分用于输出与样本对应的文本,上述分支网络部分用于重建上述输入的语谱图,具体的训练过程如下:
根据上述主体网络部分、文本标签和多个上述语谱图样本,获取上述主体网络部分的CTC损失函数;
将多个上述语谱图样本分别输入到上述分支网络部分,获取与各语谱图样本对应的重建图像,并根据多个上述语谱图样本和上述与各语谱图样本对应的重建图像,获取上述分支网络部分的损失函数;
根据上述CTC损失函数、上述分支网络部分的损失函数和预设系数,确定上述预设模型的损失函数;
将多个上述语谱图样本输入到上述预设模型中进行训练,直到上述预设模型的损失函数收敛结束。
实际应用中,通过人工识别的方式对所有语谱图样本中的每张语谱图对应的音频数据进行识别,获得每张语谱图的对应文本标签。其中,每个文本标签为对应语谱图所表示的正确文本。将全部语谱图和对应的文本标签输入到主体网络部分,将主体网络的输出文本和与对应文本标签进行比较,确定两者的差值。可以通过求所有差值的平均值的方式确定所有语谱图和对应文本标签之间的最终差值,并根据该最终差值确定主体网络部分的CTC损失函数。
将全部语谱图样本一一输入到分支网络部分,获得与各语谱图样本对应的重建图像,其中,重建图像为对输入语谱图的重新还原的图像。根据各语谱图和其对应重建图像之间的均方误差,即指所有输出的重组图像偏离对应输入语谱图的距离平方和的平均数,确定分支网络部分的损失函数。该分支网络的作用为正则化,避免预设模型过拟合,提高模型的识别率。
最后,CTC损失函数,与上述分支网络部分的损失函数乘以预设系数之 和,为预设模型的损失函数。其中,预设系数为0~1之间某个值,且该系数可以通过预设模型的训练结果进行调整,调整后预设模型需要重新进行训练,直到预设模型的损失函数完全收敛结束,将训练结束时的预设模型作为声学模型。
在本公开实施例中,分支网络部分的作用就是提供关于重建图像对应的损失函数。
步骤203,将获取的音频数据转化为对应的语谱图。
在本公开实施例中,获取一段需要设别的音频数据,并对该音频数据进行傅里叶变换得到对应的频率,接着按照该频率的时间先后生成时间-频率的语谱图。
步骤204,判断上述语谱图的帧数是否为预设帧数。
由于声学模型对输入语谱图的大小有要求,因此需要判断预输入的语谱图的宽度是否符合要求。即判断语谱图的帧数是否满足声学模型的大小要求。
为了提高声学模型的识别效率,还可以判断语谱图的高度是否满足声学模型的最佳输入要求,以便对语谱图的高度进行调整。
步骤205,若上述语谱图的帧数不为预设帧数,则对上述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为上述预设帧数。
在本公开实施例中,若语谱图的帧数小于声学模型的预设帧数,则对该语谱图进行补零操作,以使补零后的语谱图的帧数等于预设帧数,补零后的语谱图的图形更加的平滑,增加了辨识度。
若语谱图的帧数大于预设帧数,可以对语谱图进行裁剪,以使裁剪后的每个语谱图的帧数小于或等于预设帧数,并将裁剪后的语谱图一一送入声学模型进行识别。
同时,还可以对高度超过或小于声学模型高度要求的语谱图进行图像处理,以使处理后的语谱图的高度满足声学模型的高度要求。具体的图像处理技术参见现有技术。
步骤206,将上述待识别语谱图输入到声学模型,获得上述声学模型输 出的识别文本。
在本公开实施例中,将待识别的语谱图输入到声学模型中,声学模型输出该语谱图中每帧声学特征代表的多个文本,以及每个文本的分数。
步骤207,通过语言模型获取最终的识别文本。
在本公开中,语言模型(Language Model,简称LM)。语言模型是对一组字(词)序列构成的知识表示,其目的是让输出的文本尽可能符合语法,前后通顺。在本公开实施例中语言模型可以为TF-IDE语言模型、N-gram语言模型、Word2vec语言模型、CBOW语言模型、Glove语言模型等,本公开实施例不做具体的限制,可以根据具体的情况确定使用哪种语言模型。
在本公开实施例中,语言模型可以通过大量的纯文本语料进行训练得到,该纯文本语料可以是语谱图样本对应的文本标签信息,也可以是其它文本信息,例如利用爬虫技术获取的新闻等。
具体的,获得上述声学模型输出的多个待识别文本和与多个上述待识别文本分别对应的第一得分;将多个上述待识别文本分别输入到语言模型;获得上述语言模型分别对多个上述待识别文本进行识别的第二得分;根据上述第一得分和上述第二得分,确定多个上述待识别文本对应的最终得分;比较各识别文本对应的最终得分,确定最高的上述最终得分对应的上述待识别文本为最终的上述识别文本。
实际应用中,由于声学模型是在物理层面确定声学模型中每帧声学特征对应的文本,该文本不满足人们的实际需求,就需要通过语言模型来进行调整。将声学模型的输出文本输入到语言模型中,语言模型根据词典确定最佳的识别文本。具体的,声学模型会输出多个文本,并输出对应文本的分数,这些文本是单个字。语言模型按照顺序接收这些单个字和每个字对应的打分,语言模型根据词典对这些字进行重新组合、纠错等,输出多个文本并对这些文本进行打分,最后结合声学模型的打分和语言模型的打分确定最佳的文本。
例如,一个语谱图包括30帧,那么将该语谱图输入到合适的声学模型中,声学模型针对前15帧输出“你”、“您”、“另”,并输出对“你”的打分0.5,、 关于“您”的打分0.3、关于“另”的打分0.2;接着针对后15帧输出“好”,并输出关于“好”的打分“1”。将这些文本输入到语言模型中,语言模型输出“你好”和对应的打分0.4、“您好”和对应的打分0.4、“另好”和对应的打分0.2。将声学模型的打分和语言模型的打分进行结合,例如,“你好”的最总得分为0.5+1+0.4=1.9,“您好”的最总得到为0.3+1+0.4=1.7,“另好”的最总得到为0.2+1+0.2=1.4,那么确定语谱图的最终的文本为“你好”。在本公开实施例中声学模型的打分和语言模型的打分进行结合的方式还可以为:分别给声学模型和语言模型的打分分配权重,结合权重确定最总的识别文本,本公开不对结合的方式进行限定。
本公开实施例的有益效果:
1、本公开实施例将音频数据直接转化为语谱图后到声学模型中进行文本识别,减少弥补了传统计算MFCC特征造成的频域上的特征信息损失。
2、本公开实施例的损失函数不仅考虑了文本标签,还使用了重建图像作为正则项,提高了声学模型的识别率,更加有利于声学模型提取语谱图上的特征信息。
3、本公开实施例的将语谱图直接输入到声学模型中,可以用于语音识别及以语音识别为依托的语音导航、语音质检等含有语音识别需求的业务中,应用面广,精确度高。
参见图3,示出了本公开的一种语音识别的装置实施例的结构框图。具体装置如下:
音频转化模块301,用于将获取的音频数据转化为对应的语谱图;
帧数判断模块302,用于判断上述语谱图的帧数是否为预设帧数;
补零模块303,用于若上述语谱图的帧数不为预设帧数,则对上述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为上述预设帧数;
语谱图输入模块304,用于将上述待识别语谱图输入到声学模型;
识别文本获得模块305,用于获得上述声学模型输出的识别文本。
优选的,上述装置还包括:
样本获取模块,用于获取多个语谱图样本;
模型训练模块,用于将多个上述语谱图样本输入至预设模型中,以对上述预设模型进行训练,上述预设模型包括主体网络部分和分支网络部分;其中,上述主体网络部分用于输出与多个上述语谱图样本对应的文本,上述分支网络部分用于输出与多个上述语谱图样本对应的重建图像;将训练结束时的预设模型作为声学模型。
第一得分获取模块,用于获得上述声学模型输出的多个待识别文本和与多个上述待识别文本分别对应的第一得分;
识别文本输入模块,用于将多个上述待识别文本分别输入到上述语言模型;
第二得分获取模块,用于获得上述语言模型分别对多个上述待识别文本进行识别的第二得分;
最终得分模块,用于根据上述第一得分和上述第二得分,确定多个上述待识别文本分别对应的最终得分;
最终的识别文本确定模块,用于比较各识别文本对应的最终得分,确定最高的上述最终得分对应的上述待识别文本为最终的上述识别文本。
优选的,其中,上述样本获取模块具体包括如下子模块:
音频数据转化子模块,用于获取多条音频数据,并将多条上述音频数据转化为对应的多个语谱图;
补零子模块,用于对多个上述语谱图中帧数少于上述预设帧数的语谱图,进行补零操作,使得补零后得到的语谱图的帧数等于上述预设帧数;
数据增强子模块,用于将多个上述语谱图中帧数大于上述预设帧数的语谱图删除,并对剩下的语谱图进行数据增强,获得多个上述语谱图样本。
上述模型训练模块,包括:
CTC损失函数获取子模块,用于根据上述主体网络部分、文本标签和多个上述语谱图样本,获取上述主体网络部分的CTC损失函数;
分支网络部分的损失函数获取子模块,用于将多个上述语谱图样本分别输入到上述分支网络部分,获取与各语谱图样本对应的重建图像,并根据多个上述语谱图样本和上述与各语谱图样本对应的重建图像,获取上述分支网 络部分的损失函数;
预设模型的损失函数确定子模块,用于根据上述CTC损失函数、上述分支网络部分的损失函数和预设系数,确定上述预设模型的损失函数;
模型训练子模块,用于将多个上述语谱图样本输入到上述预设模型中进行训练,直到上述预设模型的损失函数收敛。
基于同一发明构思,本公开另一实施例提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,上述处理器执行时实现本公开上述任一实施例上述的方法中的步骤。
基于同一发明构思,本公开另一实施例提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开上述任一实施例上述的方法中的步骤。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本公开实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本公开实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开实施例是参照根据本公开实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的 指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本公开实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本公开实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括上述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本公开所提供的一种语音识别的方法、装置、电子设备及计算机可读存储介质,进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本公开的方法及其核心思想;同时,对于本领域的一般技术人员,依据本公开的思 想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本公开的限制。

Claims (10)

  1. 一种语音识别的方法,其特征在于,所述方法包括:
    将获取的音频数据转化为对应的语谱图;
    判断所述语谱图的帧数是否为预设帧数;
    若所述语谱图的帧数不为所述预设帧数,则对所述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为所述预设帧数;
    将所述待识别语谱图输入到声学模型;
    获得所述声学模型输出的识别文本。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取多个语谱图样本;
    将多个所述语谱图样本输入至预设模型中,以对所述预设模型进行训练,所述预设模型包括主体网络部分和分支网络部分;其中,所述主体网络部分用于输出与多个所述语谱图样本对应的文本,所述分支网络部分用于输出与多个所述语谱图样本对应的重建图像;
    将训练结束时的所述预设模型作为所述声学模型。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述预设模型进行训练的步骤,包括:
    根据所述主体网络部分、文本标签和多个所述语谱图样本,获取所述主体网络部分的CTC损失函数;
    将多个所述语谱图样本分别输入到所述分支网络部分,获取与各所述语谱图样本对应的所述重建图像,并根据多个所述语谱图样本和与各所述语谱图样本对应的所述重建图像,获取所述分支网络部分的损失函数;
    根据所述CTC损失函数、所述分支网络部分的损失函数和预设系数,确定所述预设模型的损失函数;
    将多个所述语谱图样本输入到所述预设模型中进行训练,直到所述预设模型的所述损失函数收敛。
  4. 根据权利要求1所述的方法,其特征在于,所述获得所述声学模型输出的识别文本的步骤,包括:
    获得所述声学模型输出的多个待识别文本和与多个所述待识别文本分 别对应的第一得分;
    所述方法还包括:
    将多个所述待识别文本分别输入到语言模型;
    获得所述语言模型分别对多个所述待识别文本进行识别的第二得分;
    根据所述第一得分和所述第二得分,确定多个所述待识别文本分别对应的最终得分;
    比较各所述待识别文本对应的最终得分,确定最高的所述最终得分对应的所述待识别文本为最终的所述识别文本。
  5. 根据权利要求2所述的方法,其特征在于,所述获取多个语谱图样本的步骤,包括:
    获取多条音频数据,并将多条所述音频数据转化为对应的多个所述语谱图;
    对多个所述语谱图中帧数少于所述预设帧数的所述语谱图,进行补零操作,使得补零后得到的所述语谱图的帧数等于所述预设帧数;
    将多个所述语谱图中帧数大于所述预设帧数的所述语谱图删除,并对剩下的所述语谱图进行数据增强,获得多个所述语谱图样本。
  6. 一种语音识别的装置,其特征在于,所述装置包括:
    音频转化模块,用于将获取的音频数据转化为对应的语谱图;
    帧数判断模块,用于判断所述语谱图的帧数是否为预设帧数;
    补零模块,用于若所述语谱图的帧数不为所述预设帧数,则对所述语谱图进行补零,以使补零后得到的待识别语谱图的帧数为所述预设帧数;
    声学模型模块,用于建立所述待识别语谱图和对应文本的映射关系;
    解码器模块,用于识别获得所述声学模型输出的文本。
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括:
    样本获取模块,用于获取多个语谱图样本;
    模型训练模块,用于将多个所述语谱图样本输入至预设模型中,以对所述预设模型进行训练,所述预设模型包括主体网络部分和分支网络部分;其中,所述主体网络部分用于输出与多个所述语谱图样本对应的文本,所述分 支网络部分用于输出与多个所述语谱图样本对应的重建图像;将训练结束时的所述预设模型作为所述声学模型。
  8. 根据权利要求7所述的装置,其特征在于,所述模型训练模块,包括:
    CTC损失函数获取子模块,用于根据所述主体网络部分、文本标签和多个所述语谱图样本,获取所述主体网络部分的CTC损失函数;
    分支网络部分的损失函数获取子模块,用于将多个所述语谱图样本分别输入到所述分支网络部分,获取与各所述语谱图样本对应的所述重建图像,并根据多个所述语谱图样本和与各所述语谱图样本对应的所述重建图像,获取所述分支网络部分的损失函数;
    预设模型的损失函数确定子模块,用于根据所述CTC损失函数、所述分支网络部分的损失函数和预设系数,确定所述预设模型的损失函数;
    模型训练子模块,用于将多个所述语谱图样本输入到所述预设模型中进行训练,直到所述预设模型的所述损失函数收敛。
  9. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至5中任一项所述的一种语音识别的方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至5中任一项所述的一种语音识别的方法的步骤。
PCT/CN2021/096848 2020-09-29 2021-05-28 一种语音识别的方法、装置及计算机可读存储介质 WO2022068233A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011046734.5A CN111933113B (zh) 2020-09-29 2020-09-29 一种语音识别的方法、装置、设备及介质
CN202011046734.5 2020-09-29

Publications (1)

Publication Number Publication Date
WO2022068233A1 true WO2022068233A1 (zh) 2022-04-07

Family

ID=73333712

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096848 WO2022068233A1 (zh) 2020-09-29 2021-05-28 一种语音识别的方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111933113B (zh)
WO (1) WO2022068233A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933113B (zh) * 2020-09-29 2021-03-02 北京捷通华声科技股份有限公司 一种语音识别的方法、装置、设备及介质
CN114078475B (zh) * 2021-11-08 2023-07-25 北京百度网讯科技有限公司 语音识别和更新方法、装置、设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281139A (zh) * 2016-12-30 2018-07-13 深圳光启合众科技有限公司 语音转写方法和装置、机器人
US20200051583A1 (en) * 2018-08-08 2020-02-13 Google Llc Synthesizing speech from text using neural networks
CN111145729A (zh) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111210807A (zh) * 2020-02-21 2020-05-29 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111933113A (zh) * 2020-09-29 2020-11-13 北京捷通华声科技股份有限公司 一种语音识别的方法、装置、设备及介质
CN112349289A (zh) * 2020-09-28 2021-02-09 北京捷通华声科技股份有限公司 一种语音识别方法、装置、设备以及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017217412A1 (ja) * 2016-06-16 2017-12-21 日本電気株式会社 信号処理装置、信号処理方法およびコンピュータ読み取り可能記録媒体
CN111599363B (zh) * 2019-02-01 2023-03-31 浙江大学 一种语音识别的方法及其装置
CN111063342B (zh) * 2020-01-02 2022-09-30 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN111292727B (zh) * 2020-02-03 2023-03-24 北京声智科技有限公司 一种语音识别方法及电子设备
CN111681669A (zh) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 一种基于神经网络的语音数据的识别方法与设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281139A (zh) * 2016-12-30 2018-07-13 深圳光启合众科技有限公司 语音转写方法和装置、机器人
US20200051583A1 (en) * 2018-08-08 2020-02-13 Google Llc Synthesizing speech from text using neural networks
CN111145729A (zh) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111210807A (zh) * 2020-02-21 2020-05-29 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN112349289A (zh) * 2020-09-28 2021-02-09 北京捷通华声科技股份有限公司 一种语音识别方法、装置、设备以及存储介质
CN111933113A (zh) * 2020-09-29 2020-11-13 北京捷通华声科技股份有限公司 一种语音识别的方法、装置、设备及介质

Also Published As

Publication number Publication date
CN111933113B (zh) 2021-03-02
CN111933113A (zh) 2020-11-13

Similar Documents

Publication Publication Date Title
Casanova et al. SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model
AU2019347734B2 (en) Conversational agent pipeline trained on synthetic data
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
US11908451B2 (en) Text-based virtual object animation generation method, apparatus, storage medium, and terminal
EP4018437B1 (en) Optimizing a keyword spotting system
US8478591B2 (en) Phonetic variation model building apparatus and method and phonetic recognition system and method thereof
US10515292B2 (en) Joint acoustic and visual processing
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
WO2022068233A1 (zh) 一种语音识别的方法、装置及计算机可读存储介质
CN106847259B (zh) 一种音频关键词模板的筛选和优化方法
JP2015187684A (ja) N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム
JP2019215500A (ja) 音声変換学習装置、音声変換装置、方法、及びプログラム
CN112735404A (zh) 一种语音反讽检测方法、系统、终端设备和存储介质
Ziedan et al. A unified approach for arabic language dialect detection
KR20200102309A (ko) 단어 유사도를 이용한 음성 인식 시스템 및 그 방법
CN114203180A (zh) 会议纪要的生成方法、装置、电子设备及存储介质
Duong Development of accent recognition systems for Vietnamese speech
CN113990325A (zh) 流式语音识别方法及装置、电子设备、存储介质
CN112908358B (zh) 一种开放式的语音评测方法和设备
CN113838467B (zh) 语音处理方法、装置及电子设备
CN113035247B (zh) 一种音频文本对齐方法、装置、电子设备及存储介质
CN112185346B (zh) 多语种语音关键词检测、模型生成方法及电子设备
US20230061505A1 (en) Method and apparatus for training data augmentation for end-to-end speech recognition
Tang Research on the Evaluation of Chinese Articulation Using Machine Learning and Algorithm Optimization
Vijaya et al. An Efficient System for Audio-Based Sign Language Translator Through MFCC Feature Extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873895

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21873895

Country of ref document: EP

Kind code of ref document: A1