WO2021169023A1 - 语音识别方法、装置、设备及存储介质 - Google Patents

语音识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021169023A1
WO2021169023A1 PCT/CN2020/087115 CN2020087115W WO2021169023A1 WO 2021169023 A1 WO2021169023 A1 WO 2021169023A1 CN 2020087115 W CN2020087115 W CN 2020087115W WO 2021169023 A1 WO2021169023 A1 WO 2021169023A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
fusion
voice
image
sequence
Prior art date
Application number
PCT/CN2020/087115
Other languages
English (en)
French (fr)
Inventor
吴华鑫
景子君
刘迪源
胡金水
潘嘉
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2021169023A1 publication Critical patent/WO2021169023A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • This application relates to the field of natural language processing technology, and more specifically, to a speech recognition method, device, equipment, and storage medium.
  • the traditional speech recognition technology is single speech recognition, that is, the recognition result is obtained by processing only the speech signal.
  • This speech recognition method has been able to achieve a high recognition effect in an environment with clear speech.
  • the recognition rate of traditional speech recognition technology will drop rapidly.
  • the existing multi-modal speech recognition method uses lip motion video for lip recognition, and then determines the final speech recognition result according to the lip recognition result and the accuracy of the single speech recognition result, and its speech recognition effect is still low.
  • the present application provides a voice recognition method, device, equipment, and storage medium to improve the recognition rate of the multi-modal voice recognition method.
  • a method of speech recognition includes:
  • the image in the image sequence is an image of a region related to lip movement
  • a speech recognition device includes:
  • the acquisition module is used to acquire a voice signal and an image sequence collected synchronously with the voice signal; the image in the image sequence is an image of a region related to lip movement;
  • the feature extraction module is configured to use the voice information that tends to remove the noise from the voice signal as the acquisition direction, and obtain the information that merges the voice signal and the image sequence as the fusion information;
  • the recognition module is used to perform voice recognition using the fusion information to obtain the voice recognition result of the voice signal.
  • a speech recognition device including a memory and a processor
  • the memory is used to store programs
  • the processor is configured to execute the program to implement each step of the voice recognition method described in any one of the above.
  • a computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, each step of the speech recognition method as described in any one of the above is realized.
  • the voice recognition method, device, equipment, and storage medium provided by the embodiments of the present application, after acquiring the voice signal and the image sequence collected synchronously with the voice signal, tend to remove noise from the voice signal.
  • the latter voice information is the acquisition direction, and the information of the fusion voice signal and image sequence is obtained as fusion information; the fusion information is used for speech recognition, and the speech recognition result of the speech signal is obtained.
  • the fusion information approaches the speech information after the speech signal is denoised as the acquisition direction, that is, the obtained fusion information
  • the voice information that is close to the noise-free voice signal reduces the interference of the noise in the voice signal on the voice recognition, thereby improving the voice recognition rate.
  • FIG. 1 is a flowchart of an implementation of the speech recognition method disclosed in an embodiment of the application
  • FIG. 2 is a schematic diagram of a structure of a multi-modal speech recognition model disclosed in an embodiment of the application;
  • FIG. 3 is a schematic structural diagram of a fusion feature acquisition module disclosed in an embodiment of this application.
  • FIG. 4a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of this application;
  • FIG. 4b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of the application
  • FIG. 5a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application
  • FIG. 5b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of the application
  • Fig. 6a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application;
  • FIG. 6b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of this application;
  • Fig. 7a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application;
  • FIG. 7b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 8a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 8b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 9a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 9b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 10a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 10b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 11a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • Fig. 11b is another implementation flow chart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 12a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 12b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 13a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
  • FIG. 13b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of this application.
  • FIG. 14 is a schematic structural diagram of a speech recognition device disclosed in an embodiment of this application.
  • Fig. 15 is a block diagram of the hardware structure of a speech recognition device disclosed in an embodiment of the application.
  • the inventor of the present application found that the current multi-modal speech recognition method that assists speech recognition with the help of lip motion videos uses the accuracy of the lip recognition result to compare with the accuracy of the single speech recognition result, and the accuracy is high.
  • the result is regarded as the final speech recognition result, thereby improving the speech recognition rate to a certain extent.
  • the essence of the multi-model speech recognition method is the ability of lip language recognition results to correct speech recognition results. It does not explore the ability of video signals to correct high-noise speech signals, so it is difficult to obtain high-quality recognition results.
  • the basic idea of this application is to explicitly add the idea of noise reduction to the multi-modal speech recognition task, so as to better extract the correction effect of video information on speech information , To achieve a better recognition effect.
  • FIG. 1 An implementation flowchart of the speech recognition method provided by the embodiment of the present application is shown in FIG. 1, which may include:
  • Step S11 Acquire a voice signal and an image sequence collected synchronously with the voice signal; the image in the image sequence is an image of a region related to lip movement.
  • the face video of the speaker is also collected.
  • the above-mentioned image sequence is an image sequence obtained by cutting out the lip motion-related regions of each frame of the image in the face video of the speaker.
  • a region of a fixed size for example, 80 ⁇ 80
  • 80 ⁇ 80 a region of a fixed size
  • the lip movement-related area may refer to only the lip area; or,
  • Lip movement-related areas can be the lips and their surrounding areas, such as the lips and chin area; or,
  • the lip movement-related area can be the entire face area.
  • Step S12 Taking the voice information that is close to the noise removal of the voice signal as the obtaining direction, obtain the information of the fused voice signal and the image sequence as the fused information.
  • the voice information after removing noise from the voice signal may refer to information extracted from the noise-reduced voice signal obtained by performing noise removal processing on the voice signal.
  • the fusion information that is close to the voice information in the noise-reducing voice signal is obtained, which is equivalent to performing noise-reduction processing on the voice signal.
  • Step S13 Use the fusion information to perform voice recognition to obtain a voice recognition result of the voice signal.
  • the use of fusion information for voice recognition reduces the interference of the noise in the voice signal on the voice recognition, thereby improving the accuracy of the voice recognition.
  • a multi-modal speech recognition model may be used to obtain the fusion information, and the fusion information may be used for speech recognition to obtain the speech recognition result of the speech signal.
  • the multi-modal speech recognition model can be used to process speech signals and image sequences to obtain the speech recognition results output by the multi-modal speech recognition model;
  • the multi-modal speech recognition model has the acquisition direction of information that is close to the noise removal of the speech signal, and obtains the information of the fused speech signal and image sequence as the fusion information; the fusion information is used for speech recognition to obtain the speech signal The ability of speech recognition results.
  • a schematic structural diagram of the multi-modal speech recognition model provided by this embodiment of the application may include:
  • the fusion feature acquisition module 21 is used to acquire the fusion feature of the fusion voice signal and the image sequence by taking the voice information that tends to remove the noise from the voice signal as the acquisition direction.
  • the recognition module 22 is configured to perform voice recognition based on the fusion feature acquired by the fusion feature acquisition module 21 to obtain a voice recognition result of the voice signal.
  • the foregoing process of using the multi-modal speech recognition model to process the speech signal and image sequence to obtain the speech recognition result output by the multi-modal speech recognition model can be as follows:
  • fusion feature acquisition module 21 of the multi-modal speech recognition model uses the fusion feature acquisition module 21 of the multi-modal speech recognition model to obtain the fusion feature of the fusion voice signal and image sequence with the voice information approaching the noise removal of the voice signal as the acquisition direction;
  • speech recognition is performed based on the fusion feature acquired by the fusion feature acquisition module 21 to obtain the speech recognition result of the speech signal.
  • FIG. 3 a schematic structural diagram of the fusion feature acquisition module 21 is shown in FIG. 3, which may include:
  • Voice information extraction module 31 image feature extraction module 32 and feature fusion module 33; among them,
  • the speech information extraction module 31 is used to take the speech information extracted from the speech signal and the image feature sequence extracted by the image feature extraction module 32 after the fusion of the feature of the image sequence is close to the speech information after removing the noise from the speech signal as the extraction direction, from Extract voice information from the voice signal.
  • the voice information extraction module 31 when the voice information extraction module 31 extracts voice information from the voice signal, the voice information extracted from the voice signal is fused with the image feature sequence extracted by the image feature extraction module 32 from the image sequence.
  • the voice information after denoising the voice signal is the extraction direction, and the voice information is extracted from the voice signal.
  • the image feature extraction module 32 is used to take the image feature sequence extracted from the image sequence and the voice information extracted by the voice information extraction module 31, and the feature after the fusion of the voice information extracted from the voice signal is close to the voice information after removing the noise from the voice signal as the extraction direction, from Extract the image feature sequence from the image sequence.
  • the image feature extraction module 32 when the image feature extraction module 32 extracts the image feature sequence from the image sequence, the image feature sequence extracted from the image sequence is combined with the voice information extracted by the voice information extraction module 31 from the voice signal. Approaching to the speech information after denoising the speech signal is the extraction direction, and the image feature sequence is extracted from the image sequence.
  • the feature fusion module 33 is used for fusing the extracted voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction to obtain the fusion feature.
  • the fusion feature is close to the voice information after the noise is removed from the voice signal as the fusion direction, and the extracted voice signal and image feature sequence Perform fusion.
  • the features after the fusion of the extracted voice information and image feature sequence are close to the right
  • the voice information after the noise is removed from the voice signal is executed for the direction.
  • the aforementioned use of the aforementioned fusion feature acquisition module 21 takes the voice information after noise removal from the voice signal as the acquisition direction, and an implementation method for acquiring the fusion features of the fused voice signal and image sequence can be for:
  • voice information extraction module 31 Take the voice information that tends to remove the noise from the voice signal as the acquisition direction, use the voice information extraction module 31 to extract the voice information from the voice signal, use the image feature extraction module 32 to extract the image feature sequence from the image sequence; use the feature fusion module 33
  • the voice information extracted by the voice information extraction module 31 and the image feature sequence extracted by the image feature extraction module 32 are fused to obtain the fusion feature of the fused voice signal and image sequence.
  • the voice information extraction module 31 is used to extract the voice from the voice signal Information
  • the image feature extraction module 32 is used to extract the image feature sequence from the image sequence.
  • the feature fusion module 33 is used to fuse the extracted voice information and the image feature sequence to obtain the fusion feature.
  • the voice information extracted from the voice signal may be N types, and N is a positive integer greater than or equal to 1. Then, the foregoing process of extracting voice information from the voice signal by using the voice information extraction module 31 may include any one of the following two extraction methods:
  • Extraction method 1 Use the voice information extraction module 31 to fuse the N types of voice information extracted with the image feature extraction module 32 from the image feature sequence extracted from the image sequence.
  • the feature after the fusion is close to a voice after noise removal from the voice signal
  • the information is the extraction direction, and N types of voice information are extracted from the voice signal.
  • the fusion feature is close to a voice information after the noise is removed from the voice signal as the extraction direction.
  • the specific implementation of the first extraction method can be:
  • the voice information of the target type extracted by the voice information extraction module 31 is fused with the image feature sequence extracted by the image feature extraction module 32 from the image sequence.
  • the feature after the fusion is close to the voice information of the target type after the noise is removed from the voice signal.
  • the voice information of the target category is extracted from the voice signal.
  • the specific implementation of the first extraction method can be:
  • the voice information extraction module 31 uses the voice information extraction module 31, the feature after the fusion of the extracted N types of voice information and the image feature sequence extracted from the image sequence is close to one of the voice information after the noise is removed from the voice signal as the extraction direction, from the voice signal Extract N kinds of voice information.
  • the extraction direction for extraction for example, suppose that there are two types of extracted voice information, namely type A voice information and type B voice information, then in the embodiment of the present application,
  • the voice information extraction module 31 can be used to extract the A-type voice information and the B-type voice information and the image feature extraction module 32.
  • the image feature sequence extracted from the image sequence is fused.
  • the voice information is the extraction direction, and the A-type voice information and the B-type voice information are extracted from the voice signal.
  • the voice information extraction module 31 can be used to extract the A-type voice information and the B-type voice information and the image feature extraction module 32.
  • the image feature sequence extracted from the image sequence is fused.
  • the voice information is the extraction direction, and the A-type voice information and the B-type voice information are extracted from the voice signal.
  • Extraction method 2 If N is greater than 1, the voice information extraction module 31 is used to extract each voice information and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after fusion is close to the removal of the voice signal This kind of speech information after noise is the extraction direction, and N kinds of speech information are extracted from the speech signal.
  • the voice information extracted by the voice information extraction module 31 and the image feature sequence fused with features that are close to the voice information after removing noise from the voice signal is taken as the extraction direction , Extract N kinds of voice information from the voice signal.
  • the fusion of the voice information and the image feature sequence includes: the voice information is only fused with the image feature sequence. Or, the voice information, the image feature sequence, and the fusion features of other extracted voice information are fused.
  • the voice information extracted from the voice signal may be only acoustic features (for example, fbank feature, or Mel frequency cepstral coefficient MFCC feature), or may be only spectrogram feature, or , Can include acoustic features and spectrogram features.
  • acoustic features for example, fbank feature, or Mel frequency cepstral coefficient MFCC feature
  • spectrogram feature for example, fbank feature, or Mel frequency cepstral coefficient MFCC feature
  • the fusion direction is the voice information that tends to remove the noise from the voice signal
  • the feature fusion module 33 is used to fuse the voice information and the image feature sequence
  • the process of obtaining the fusion feature of the fused voice signal and image sequence may include:
  • Fusion method 1 Use the feature fusion module 33 to fuse the extracted acoustic features and the image feature sequence with the acoustic features that are close to the denoising of the speech signal as the fusion direction to obtain the fusion features corresponding to the fusion method;
  • Fusion method 2 Use the feature fusion module 33 to fuse the extracted spectrogram features and the image feature sequence with the spectrogram feature after denoising the speech signal as the fusion direction to obtain the fusion feature corresponding to the fusion method two;
  • Fusion method 3 Use the feature fusion module 33 to fuse the extracted acoustic features, spectrogram features, and image feature sequences with the acoustic features or spectrogram features approaching the denoising of the speech signal as the fusion direction to obtain the fusion method Three corresponding fusion features.
  • the fusion feature corresponding to this fusion method is the fusion feature of the fusion voice signal and the image sequence.
  • the fusion feature corresponding to the fusion method 1 is the fusion feature of the fusion voice signal and the image sequence; if the fusion feature is obtained according to the fusion method 2.
  • the fusion feature acquires the fusion feature of the fusion voice signal and the image sequence, then the fusion feature corresponding to the above fusion method two is the fusion feature of the fusion voice signal and the image sequence; the same is true, if the fusion voice signal is obtained according to the fusion feature obtained by the fusion method three With the fusion feature of the image sequence, the fusion feature corresponding to the above-mentioned fusion method three is the fusion feature of the fusion voice signal and the image sequence.
  • the fusion feature corresponding to the fusion method 1 and the fusion feature corresponding to the fusion method 2 are fused to obtain the fused voice signal and image
  • the fusion feature corresponding to the fusion mode three is the fusion feature of the fusion voice signal and image sequence.
  • the process of extracting the voice information and obtaining the fusion feature of the fused voice signal and image sequence will be explained by taking the voice information as the acoustic feature and/or the spectrogram feature as an example.
  • the voice information extraction module 31 when used to extract the voice information of the target type from the voice signal, it can be specifically used for:
  • the voice information extraction module 31 uses the voice information extraction module 31 to extract the acoustic features extracted from the voice signal and the image feature sequence extracted from the image feature sequence by the image feature extraction module 32 as the extraction direction. Extract acoustic features from the speech signal.
  • the acoustic feature extraction module of the voice information extraction module 31 can be used to extract acoustic features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module for fusing the acoustic features extracted from the voice signal with the image feature sequence extracted by the image feature extraction module 32 from the image sequence. The feature is close to the acoustic feature after the noise is removed from the speech signal, which is the extraction direction, and the acoustic feature is extracted from the speech signal.
  • the voice signal input to the multi-modal voice recognition model may be an acoustic feature extracted from the original voice signal (that is, the voice signal collected by the audio collection device) through a sliding window (for ease of description, it is recorded as the initial acoustic feature),
  • the acoustic feature extracted by the voice information extraction module 31 from the voice signal may be a hidden layer feature of the initial acoustic feature.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction to obtain the fusion feature.
  • the feature fusion module 33 is used to fuse the extracted acoustic features and the image feature sequence to obtain the fusion feature of the fused voice signal and the image sequence by taking the acoustic feature that tends to remove noise from the voice signal as the fusion direction.
  • the voice information of the target type is a spectrogram feature
  • the voice information extraction module 31 when using the voice information extraction module 31 to extract the voice information of the target type from the voice signal, it may specifically include:
  • the voice information extraction module 31 is used to extract the spectrogram features extracted from the voice signal and the image feature extraction module 32.
  • the image feature sequence extracted from the image sequence is fused, and the feature is close to the spectrogram feature after the noise is removed from the voice signal.
  • the spectrogram feature extraction module of the voice information extraction module 31 can be used to extract the spectrogram feature from the voice signal.
  • the voice information extraction module 31 includes a spectrogram feature extraction module for fusing the spectrogram feature extracted from the voice signal with the image feature sequence extracted from the image sequence by the image feature extraction module 32
  • the latter feature tends to be the extraction direction of the spectrogram feature after removing the noise from the speech signal, and the spectrogram feature is extracted from the speech signal.
  • the speech signal input to the multi-modal speech recognition model may be a spectrogram obtained by performing short-time Fourier transform on the original speech signal, and the spectrogram feature extracted from the speech signal by the speech information extraction module 31 may be a frequency spectrum.
  • Hidden layer features of the graph may be a spectrogram obtained by performing short-time Fourier transform on the original speech signal, and the spectrogram feature extracted from the speech signal by the speech information extraction module 31 may be a frequency spectrum.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction, and another way to obtain the fusion feature can be:
  • the feature fusion module 33 uses the spectrogram feature after noise removal from the voice signal as the fusion direction to fuse the extracted spectrogram feature and the image feature sequence to obtain the fusion feature of the fused voice signal and image sequence.
  • one implementation manner of extracting two types of voice information from the voice signal by using the voice information extraction module 31 may be:
  • the voice information extraction module 31 is used to extract the spectrogram features, acoustic features, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
  • the fusion feature is close to the spectrogram feature after removing noise from the voice signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal.
  • the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where the acoustic feature extraction module is used to extract the acoustic features from the speech signal, and the spectrogram feature extraction module The spectrogram feature extracted from the speech signal and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
  • the fused feature is close to the spectrogram feature after removing the noise from the speech signal. It is the extraction direction.
  • the spectrogram feature extraction module is used to extract the spectrogram feature from the speech signal, the acoustic feature extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32
  • the fused features tend to be the extraction direction of the spectrogram feature after noise removal from the speech signal, and the spectrogram feature is extracted from the speech signal.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
  • the spectrogram feature after removing noise from the speech signal is used as the fusion direction, and the spectrogram feature and the first fusion feature are fused to obtain the fusion voice signal and image sequence. Fusion features.
  • another implementation manner of extracting the two types of voice information from the voice signal by using the voice information extraction module 31 may be:
  • the voice information extraction module 31 uses the voice information extraction module 31 to extract the spectrogram features, acoustic features, and the image feature extraction module 32 from the image sequence, the fusion features after the fusion of the image feature sequence are close to the acoustic features after removing the noise from the voice signal as the extraction direction , Extract spectrogram features and acoustic features from the speech signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
  • the acoustic feature extraction module is used to extract the acoustic features, the spectrogram features extracted from the voice signal by the spectrogram extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module.
  • the fusion features are close to the voice signal
  • the acoustic feature after noise removal is the extraction direction, and the acoustic feature is extracted from the speech signal;
  • the spectrogram feature extraction module is used to use the extracted spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module.
  • the fusion feature is close to the voice
  • the acoustic feature after the signal is denoised is the extraction direction, and the spectrogram feature is extracted from the speech signal.
  • the above-mentioned voice information that tends to remove noise from the voice signal is used as the fusion direction, and the feature fusion module 33 is used to fuse the extracted voice information and image feature sequence to obtain the fusion feature.
  • One way to achieve the fusion feature can be:
  • the second feature fusion module of the feature fusion module 33 is used to fuse the acoustic features after noise removal from the voice signal as the fusion direction, and fuse the extracted acoustic features and the second fusion feature to obtain a fusion of the voice signal and image sequence. Fusion features.
  • the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
  • Using the voice information extraction module 31 to extract the spectrogram features, acoustic features, and the image feature extraction module 32 from the image sequence extracted from the image feature sequence fusion features tend to be the acoustic features after removing noise from the voice signal, and extract The acoustic feature and image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the acoustic feature after noise removal is the extraction direction, and the spectrogram feature and the acoustic feature are extracted from the speech signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
  • the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the fused feature is close to the acoustic feature after noise removal is the extraction direction. From the voice signal Extract acoustic features;
  • the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
  • the fusion feature is close to the right
  • the acoustic feature of the voice signal after noise removal is the extraction direction, and the spectrogram feature is extracted from the voice signal.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
  • the second feature fusion module of the feature fusion module 33 take the acoustic features that tend to remove noise from the voice signal as the fusion direction, and fuse the acoustic features and the second fusion feature to obtain the fusion feature of the fused voice signal and image sequence .
  • the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
  • the fused features of the image feature sequence are close to the spectrogram features after removing noise from the voice signal, and
  • the extracted spectrogram feature and the image feature sequence extracted from the image sequence are fused and the feature is close to the spectrogram feature after removing the noise as the extraction direction, and the spectrogram feature and acoustic feature are extracted from the speech signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
  • the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the spectrogram feature of the voice signal after the noise is removed is the extraction direction, and the acoustic feature is extracted from the voice signal;
  • the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
  • the fusion feature is close to the right
  • the noise-removed spectrogram features of the speech signal, as well as the extracted spectrogram features and the image feature extraction module 32 extracts from the image sequence.
  • the fused feature of the image feature sequence is close to the spectrogram feature after noise removal is the extraction direction. Extract spectrogram features from the speech signal.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
  • the feature fusion module No. 5 of the feature fusion module 33 take the spectrogram feature after noise removal from the voice signal as the fusion direction, and fuse the extracted spectrogram feature and the first fusion feature to obtain the fused voice signal and image The fusion characteristics of the sequence.
  • the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
  • the feature is close to the spectrogram feature after noise removal from the voice signal, and the extracted acoustic feature
  • the image feature extraction module 32 extracted from the image sequence, the feature after fusion of the image feature sequence is close to the acoustic feature after noise removal is the extraction direction, and the spectrogram feature and the acoustic feature are extracted from the speech signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
  • the acoustic feature extraction module is used to extract the acoustic features from the speech signal by taking the extracted acoustic features and the image feature sequence extracted by the image feature extraction module 32 from the image sequence as the extraction direction. ;
  • the spectrogram feature extraction module is used to take the extracted spectrogram features and the image feature sequence extracted by the image feature extraction module 32 from the image sequence.
  • the feature after the fusion is close to the spectrogram feature after removing the noise as the extraction direction, from the speech signal Extract spectrogram features.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction, and another way to obtain the fusion feature can be:
  • the spectrogram feature after noise removal from the voice signal is used as the fusion direction, and the spectrogram feature and the image feature sequence are fused to obtain the second fusion feature;
  • the feature fusion module No. 4 of the feature fusion module 33 is used to fuse the first fusion feature and the second fusion feature to obtain the fusion feature of the fused voice signal and image sequence.
  • the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
  • the fusion features of the image feature sequence are close to the noise-removed acoustic features, and the extracted spectrogram features, acoustic features, and
  • the image feature extraction module 32 extracts the image feature sequence from the image sequence and the fusion feature is close to the spectrogram feature after removing noise from the speech signal as the extraction direction, and extracts the spectrogram feature and the acoustic feature from the speech signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
  • the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the noise-removed spectrogram feature of the voice signal, and the extracted acoustic feature is fused with the image feature sequence extracted by the image feature extraction module 32.
  • the feature is close to the noise-removed acoustic feature as the extraction direction, and the acoustic feature is extracted from the voice signal;
  • the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
  • the fusion feature is close to the right
  • the spectrogram feature of the voice signal after noise removal is the extraction direction, and the spectrogram feature is extracted from the voice signal.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
  • the spectrogram feature after removing noise from the speech signal is used as the fusion direction, and the spectrogram feature and the first fusion feature obtained by the No. 3 feature fusion module are fused to obtain Fusion features of voice signal and image sequence.
  • the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
  • the voice information extraction module 31 is used to extract the spectrogram features extracted from the image sequence and the image feature sequence fusion feature that is close to the spectrogram feature after noise removal, as well as the extracted spectrogram feature, acoustic feature and image feature extraction
  • the fusion feature of the image feature sequence extracted by the module 32 from the image sequence is close to the acoustic feature after removing noise from the voice signal as the extraction direction, and the spectrogram feature and the acoustic feature are extracted from the voice signal.
  • the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
  • the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
  • the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the acoustic features of the voice signal after noise removal is the extraction direction, and the acoustic features are extracted from the voice signal;
  • the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
  • the fusion feature is close to the right
  • the feature after fusion is close to the spectrogram feature after the noise is removed.
  • the spectrogram feature is the extraction direction, and the spectrogram is extracted from the voice signal feature.
  • the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
  • the No. 1 feature fusion module of the feature fusion module 33 take the spectrogram feature after noise removal from the speech signal as the fusion direction, and fuse the extracted spectrogram feature and the image feature sequence to obtain the second fusion feature;
  • the second feature fusion module of the feature fusion module 33 take the acoustic features that tend to remove noise from the voice signal as the fusion direction, and fuse the acoustic features and the second fusion feature to obtain the fusion feature of the fused voice signal and image sequence .
  • the voice signal input to the multi-modal voice recognition model may be the initial acoustic feature extracted from the original voice signal through a sliding window, and obtained by performing short-time Fourier transform on the original voice signal Spectrogram
  • the acoustic feature extracted from the voice signal by the voice information extraction module 31 may be the hidden layer feature of the initial acoustic feature
  • the spectrogram feature extracted from the voice signal may be the hidden layer feature of the spectrogram.
  • FIGS. 4a and 4b where FIG. 4a is a schematic diagram of an architecture for training a multi-modal speech recognition model provided by an embodiment of this application, and FIG. 4b is a diagram of a multi-modal speech recognition model.
  • An implementation flow chart of the training of the speech recognition model may include:
  • Step S41 Obtain the noise-free speech information (ie, the clear speech information in Figure 4a) of the noise-free speech signal (also called clear speech signal) in the training sample through the multi-modal speech recognition model, and the training sample contains
  • the noise voice information of the noise-free voice signal is the noise voice signal.
  • the noise-free voice signal can be generated by adding noise to the noise-free voice signal.
  • the noise-free speech signal can be obtained by performing denoising processing on the noisy speech signal.
  • Step S42 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • the sample image feature sequence can be extracted from the sample image sequence by the image feature extraction module 32.
  • Step S43 Fuse the noisy voice information and the feature sequence of the sample image through the multimodal voice recognition model to obtain the fusion feature of the training sample.
  • the noisy voice information and the sample image feature sequence can be fused by the feature fusion module 33 to obtain the fusion feature of the training sample.
  • Step S44 Use the fusion feature of the training sample to perform voice recognition through the multimodal voice recognition model, and obtain the voice recognition result corresponding to the training sample.
  • the recognition module 22 can use the fusion feature of the training sample to perform speech recognition, and obtain the speech recognition result corresponding to the training sample.
  • Step S45 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the noise-free speech information, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample.
  • the parameters are updated.
  • the first loss function can be used to calculate the difference between the fusion feature of the training sample and the noise-free speech information (for ease of description, it is recorded as the first difference)
  • the second loss function is used to calculate the speech recognition result and the training sample corresponding to the training sample.
  • the difference of the sample label of (for ease of description, it is recorded as the second difference)
  • the parameters of the multi-modal speech recognition model are updated according to the weighted sum of the first difference and the second difference.
  • the multi-modal speech recognition model trained based on the multi-modal speech recognition model training method shown in Fig. 4a-Fig. 4b has the ability to obtain information that is close to the noise removal of the speech signal as the acquisition direction to obtain the fused speech signal and image
  • the information of the sequence is used as the fusion information; the ability to use the fusion information for speech recognition to obtain the speech recognition result of the speech signal.
  • the following describes the training process of the multi-modal speech recognition model according to the difference of the speech information.
  • Figure 5a is a schematic diagram of an architecture for training a multi-modal speech recognition model
  • Figure 5b is a multi-modal speech recognition model.
  • An implementation flow chart for training the modal speech recognition model can include:
  • Step S51 Acquire the acoustic features of the noise-free speech signal in the training sample through the multi-modal speech recognition model (that is, the clear acoustic feature in Figure 5a, which can also be called the noise-free acoustic feature), and the training sample contains the above-mentioned noise.
  • the acoustic characteristics of the noisy speech signal of the noisy speech signal ie, the noise acoustic characteristic in Figure 5a).
  • the acoustic feature extraction module of the speech information extraction module 31 can extract clear acoustic features from a noise-free speech signal, and extract noise acoustic features from a noisy speech signal.
  • the acquisition process of the noise voice signal and the noise-free voice signal can refer to the foregoing embodiment, and will not be repeated here.
  • Step S52 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S53 Fuse the noise acoustic feature and the sample image feature sequence through the multi-modal speech recognition model to obtain the fusion feature of the training sample.
  • Step S54 Perform voice recognition using the fusion feature of the training samples through the multimodal voice recognition model, and obtain the voice recognition results corresponding to the training samples.
  • Step S55 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the noise-free acoustic feature, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample. The parameters are updated.
  • the first difference between the fusion feature of the training sample and the clear acoustic feature can be calculated by the first loss function
  • the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function, according to The weighted sum of the first difference and the second difference updates the parameters of the multi-modal speech recognition model.
  • the acoustic feature extraction module is equipped with acoustic features to extract the voice signal and the image feature extraction module 32 extracts the image sequence
  • the fusion feature of the image feature sequence of the image feature tends to be the extraction direction of the acoustic feature after the noise is removed from the voice signal, and the ability to extract the acoustic feature from the voice signal;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extracted by the acoustic feature extraction module on the voice signal.
  • the feature after the fusion is close to the acoustic feature after removing the noise from the voice signal as the extraction direction.
  • the feature fusion module 33 has the ability to fuse the extracted acoustic features and the image feature sequence by taking the fused features approaching the acoustic features after noise removal from the speech signal as the fusion direction to obtain the fused features.
  • Figure 6a is a schematic diagram of an architecture for training a multi-modal speech recognition model
  • Figure 6b is An implementation flow chart for training a multi-modal speech recognition model, which may include:
  • Step S61 Obtain the spectrogram feature of the noise-free speech signal in the training sample through the multi-modal speech recognition model (ie the clear spectrogram feature in Figure 6a, which can also be called the noise-free spectrogram feature), and the training sample contains The spectrogram characteristic of the noise speech signal of the noise-free speech signal (that is, the noise spectrogram characteristic in FIG. 6a).
  • the spectrogram feature extraction module of the speech information extraction module 31 can extract the clear spectrogram feature from the noise-free speech signal, and extract the noise spectrogram feature from the noise speech signal.
  • the acquisition process of the noise voice signal and the noise-free voice signal can refer to the foregoing embodiment, and will not be repeated here.
  • Step S62 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S63 Fuse the noise spectrogram feature and the sample image feature sequence through the multi-modal speech recognition model to obtain the fusion feature of the training sample.
  • the feature fusion module 33 can be used to fuse the spectrogram feature of the noise speech signal and the feature sequence of the sample image to obtain the fusion feature of the training sample.
  • Step S64 Perform voice recognition using the fusion feature of the training samples through the multimodal voice recognition model, and obtain the voice recognition results corresponding to the training samples.
  • Step S65 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the noise-free spectrogram feature, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample. The parameters are updated.
  • the first difference between the fusion feature and the clear spectrogram feature can be calculated by the first loss function
  • the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function.
  • the weighted sum of the difference and the second difference updates the parameters of the multimodal speech recognition model.
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the speech signal and the image feature extraction module 32 pairs of images
  • the image feature extracted by the sequence The feature after the fusion of the sequence tends to be the extraction direction of the spectrogram feature after removing the noise from the speech signal, and the ability to extract the spectrogram feature from the speech signal;
  • the image feature extraction module 32 is provided with the image feature sequence extracted from the image sequence and the spectrogram feature extracted by the spectrogram feature extraction module on the voice signal.
  • the feature after the fusion of the spectrogram feature of the voice signal is close to the spectrogram feature after the noise is removed from the voice signal as the extraction direction.
  • the feature fusion module 33 has the ability to fuse the extracted spectrogram features and the image feature sequence to obtain the fusion feature by taking the fused feature close to the spectrogram feature after noise removal from the speech signal as the fusion direction.
  • Figure 7a is a schematic diagram of an architecture for training a multi-modal speech recognition model
  • Figure 7b An implementation flow chart for training the multi-modal speech recognition model may include:
  • Step S71 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in FIG.
  • the spectrogram characteristics of the noisy speech signal that is, the noise spectrogram characteristic in FIG. 7a
  • the acoustic characteristic that is, the noise acoustic characteristic in FIG. 7a
  • Step S72 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S73 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
  • Step S74 Fuse the spectrogram feature of the noise speech signal and the first fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
  • Step S75 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
  • Step S76 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the feature of the noise-free spectrogram, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample. The parameters are updated.
  • the first difference between the fusion feature and the clear spectrogram feature can be calculated by the first loss function
  • the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function.
  • the weighted sum of the difference and the second difference updates the parameters of the multimodal speech recognition model.
  • the acoustic feature extraction module is equipped with the acoustic features for the extraction of the voice signal and the spectrogram feature extraction module for the voice signal extraction
  • the spectrogram feature and image feature extraction module 32 extracts the image feature sequence of the image sequence.
  • the feature after the fusion of the image feature sequence is close to the spectrogram feature after removing the noise from the speech signal as the extraction direction, and the ability to extract acoustic features from the speech signal;
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the voice
  • the spectrogram feature after the noise is removed from the signal is the extraction direction, and the ability to extract the spectrogram feature from the speech signal;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
  • the feature after the fusion is close to the voice
  • the spectrogram feature after the signal noise is removed is the extraction direction, and the ability to extract the image feature sequence from the image sequence;
  • the feature fusion module 33 has the ability to fuse acoustic features, spectrogram features, and image feature sequences with the fused features approaching the spectrogram feature after noise removal from the speech signal as the fusion direction to obtain fusion features.
  • Figure 8a is a schematic diagram of another architecture for training a multimodal speech recognition model.
  • Figure 8b is another implementation flow chart for training the multi-modal speech recognition model, which can include:
  • Step S81 Acquire the acoustic features of the noise-free speech signal in the training sample through the multi-modal speech recognition model (ie the clear acoustic feature in Figure 8a, that is, the noise-free acoustic feature), and the training sample contains the aforementioned noise-free speech signal
  • the spectrogram feature of the noise speech signal i.e. the noise spectrogram feature in Figure 8a
  • the acoustic feature i.e. the noise acoustic feature in Figure 8a.
  • Step S82 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S83 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
  • Step S84 Fusion the acoustic features of the noisy speech signal and the second fusion feature of the training sample through the multi-modal speech recognition model to obtain the fusion feature of the training sample.
  • Step S85 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
  • Step S86 Through the multi-modal speech recognition model, the fusion feature of the training sample approaches the noiseless acoustic feature, and the speech recognition result corresponding to the training sample approaches the sample label of the training sample. The parameters are updated.
  • the first difference between the fusion feature and the clear acoustic feature can be calculated by the first loss function
  • the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function, according to the first difference
  • the weighted sum with the second difference updates the parameters of the multi-modal speech recognition model.
  • the acoustic feature extraction module is equipped with acoustic features to extract the voice signal and the spectrogram feature extraction module to extract the voice signal
  • the feature of the spectrogram feature and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the acoustic feature after removing the noise from the voice signal, which is the extraction direction, and the ability to extract the acoustic feature from the voice signal;
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the voice
  • the acoustic feature after the signal is de-noised is the extraction direction, and the ability to extract the spectrogram feature from the speech signal;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
  • the feature after the fusion is close to the voice
  • the acoustic feature after the signal is de-noised is the extraction direction, and the ability to extract the image feature sequence from the image sequence;
  • the feature fusion module 33 has the ability to fuse the extracted acoustic features, spectrogram features, and image feature sequences with the fused features approaching the acoustic features after noise removal from the speech signal as the fusion direction to obtain the fused features.
  • Figure 9a is a schematic diagram of another architecture for training a multimodal speech recognition model.
  • Figure 9b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
  • Step S91 Acquire the acoustic features of the noise-free speech signal in the training sample through the multi-modal speech recognition model (ie the clear acoustic feature in Figure 9a, that is, the noise-free acoustic feature), and the training sample contains the aforementioned noise-free speech signal
  • the spectrogram feature of the noise speech signal i.e. the noise spectrogram feature in Figure 9a
  • the acoustic feature i.e. the noise acoustic feature in Figure 9a.
  • Step S92 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S93 Fuse the acoustic features of the noise voice signal and the image feature sequence through the multi-modal voice recognition model to obtain the first fusion feature of the training sample.
  • Step S94 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
  • Step S95 Fuse the noise acoustic feature and the second fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
  • Step S96 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
  • Step S97 Through the multi-modal speech recognition model, the first fusion feature of the training sample approaches the noiseless acoustic feature, the fusion feature of the training sample approaches the noiseless acoustic feature, and the speech recognition result corresponding to the training sample approaches the training
  • the sample label of the sample is the target, and the parameters of the multimodal speech recognition model are updated.
  • the first difference between the first fusion feature of the training sample and the clear acoustic feature can be calculated through the first loss function
  • the second difference between the fusion feature of the training sample and the clear acoustic feature can be calculated through the first loss function
  • the second difference between the fusion feature and the clear acoustic feature of the training sample can be calculated through the second loss function.
  • the loss function calculates the third difference between the speech recognition result corresponding to the training sample and the sample label of the training sample, and updates the parameters of the multimodal speech recognition model according to the weighted sum of the first difference, the second difference, and the third difference.
  • the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
  • the acoustic feature extraction module is equipped with acoustic features for extracting voice signals and a spectrogram feature extraction module for voice signal extraction
  • the spectrogram feature and image feature extraction module 32 extracts the image feature sequence of the image sequence.
  • the feature after the fusion of the image feature sequence is close to the acoustic feature after removing the noise from the voice signal.
  • the acoustic feature extracted from the voice signal and the image feature extraction module 32 compares the image The image feature extracted by the sequence The feature after the fusion of the sequence is close to the acoustic feature after the noise is removed from the voice signal, which is the extraction direction, and the ability to extract the acoustic feature from the voice signal;
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the voice
  • the acoustic feature after the signal is de-noised is the extraction direction, and the ability to extract the spectrogram feature from the speech signal;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
  • the feature after the fusion is close to the voice
  • the feature after fusion is close to the acoustic feature after removing the noise from the voice signal.
  • the extraction direction is from the image The ability to extract image feature sequences from the sequence;
  • the feature fusion module 33 is capable of fusing the features obtained by the fusion approach to the acoustic features after noise removal from the speech signal as the fusion direction, fusing the acoustic features and the image feature sequence to obtain the first fusion feature, and comparing the spectrogram feature and the image feature sequence The fusion is performed to obtain the second fusion feature, and the acoustic feature and the second fusion feature are fused to obtain the ability of the fusion feature.
  • Figure 10a is a schematic diagram of another architecture for training a multimodal speech recognition model.
  • Figure 10b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
  • Step S101 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in Figure 10a, that is, the noise-free spectrogram feature) through the multimodal speech recognition model, and the training sample contains the above-mentioned non-noise
  • the spectrogram feature of the noisy speech signal i.e., the noise spectrogram feature in FIG. 10a
  • the acoustic feature i.e., the noise acoustic feature in FIG. 10a
  • Step S102 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S103 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
  • Step S104 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
  • Step S105 Fuse the noise spectrogram feature and the first fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
  • Step S106 Perform voice recognition on the fusion feature of the training sample through the multimodal voice recognition model, and obtain a voice recognition result corresponding to the training sample.
  • Step S107 The second fusion feature of the training sample approaches the noise-free spectrogram feature through the multimodal speech recognition model, the fusion feature of the training sample approaches the noise-free spectrogram feature, and the speech recognition result corresponding to the training sample approaches Taking the sample label of the training sample as the target, the parameters of the multi-modal speech recognition model are updated.
  • the first difference between the second fusion feature of the training sample and the noise-free spectrogram feature can be calculated by the first loss function
  • the second difference between the fusion feature of the training sample and the noise-free spectrogram feature can be calculated by the first loss function
  • the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
  • the acoustic feature extraction module is equipped with the acoustic features for the voice signal extraction and the spectrogram feature extraction module for the voice signal extraction
  • the spectrogram feature and image feature extraction module 32 extracts the image feature sequence of the image sequence.
  • the feature after the fusion of the image feature sequence is close to the spectrogram feature after removing the noise from the speech signal as the extraction direction, and the ability to extract acoustic features from the speech signal;
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the voice
  • the spectrogram feature after the noise is removed from the signal, the spectrogram feature extracted from the speech signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion is close to the spectrogram feature after removing the noise from the speech signal.
  • Direction the ability to extract spectrogram features from the speech signal;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
  • the feature after the fusion is close to the voice
  • the spectrogram feature after the noise is removed from the signal, the image feature sequence extracted from the image sequence and the spectrogram feature extracted by the spectrogram feature extraction module for the voice signal.
  • the feature after fusion is close to the spectrogram feature after removing the noise from the speech signal.
  • Direction the ability to extract image feature sequences from image sequences;
  • the feature fusion module 33 is equipped with the fusion direction that the fused feature is close to the spectrogram feature after noise removal from the speech signal, fuse the spectrogram feature and the image feature sequence to obtain the second fusion feature, and compare the acoustic feature and the image feature The sequence is fused to obtain the first fusion feature, and the spectrogram feature and the first fusion feature are fused to obtain the ability of the fusion feature.
  • Figure 11a is a schematic diagram of another architecture for training a multimodal speech recognition model.
  • Figure 11b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
  • Step S111 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in Figure 11a, that is, the noise-free spectrogram feature) and the acoustic feature (ie, the noise-free spectrogram feature in Figure 11a) through the multimodal speech recognition model.
  • the clear acoustic features of the above noise-free voice signals in the training samples ie the noise spectrogram features in Figure 11a
  • the acoustic features ie the noise acoustic features in Figure 11a
  • Step S112 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S113 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
  • Step S114 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
  • Step S115 The first fusion feature of the training sample and the second fusion feature of the training sample are fused through the multimodal speech recognition model to obtain the fusion feature of the training sample.
  • Step S116 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
  • Step S117 Through the multimodal speech recognition model, the first fusion feature of the training sample approaches the noise-free acoustic feature, the second fusion feature of the training sample approaches the feature of the noise-free spectrogram, and the speech recognition result corresponding to the training sample tends to The sample label close to the training sample is the target, and the parameters of the multimodal speech recognition model are updated.
  • the first difference between the first fusion feature of the training sample and the noise-free acoustic feature can be calculated by the first loss function
  • the second fusion feature of the training sample and the second fusion feature of the noise-free spectrogram feature can be calculated by the first loss function
  • Difference, the third difference between the speech recognition result corresponding to the training sample and the sample label of the training sample is calculated by the second loss function, based on the weighting of the first difference, the second difference and the third difference and the parameters of the multimodal speech recognition model Update.
  • the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
  • the acoustic feature extraction module is equipped with acoustic features to extract the voice signal and the image feature extraction module 32 extracts the image sequence
  • the fusion feature of the image feature sequence of the image feature tends to be the extraction direction of the acoustic feature after the noise is removed from the voice signal, and the ability to extract the acoustic feature from the voice signal;
  • the spectrogram feature extraction module is equipped with the extraction direction of the spectrogram feature extracted from the voice signal and the image feature sequence extracted by the image feature extraction module 32.
  • the feature after the fusion of the image feature sequence is close to the spectrogram feature after removing the noise from the voice signal.
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extracted by the acoustic feature extraction module on the voice signal.
  • the feature after the fusion is close to the acoustic feature after removing noise from the voice signal, and the image is extracted from the image sequence.
  • the feature sequence and spectrogram feature extraction module has the ability to extract the image feature sequence from the image sequence.
  • the feature after the fusion of the spectrogram feature extracted by the voice signal is close to the spectrogram feature after removing the noise from the voice signal as the extraction direction;
  • the feature fusion module 33 is equipped with the second fusion feature obtained by the fusion approaching the spectrogram feature after noise removal from the speech signal as the fusion direction, and the spectrogram feature and the image feature sequence are fused to obtain the second fusion feature;
  • the first fusion feature of the voice signal is close to the fusion direction of the acoustic feature after the noise is removed from the speech signal.
  • the acoustic feature and the image feature sequence are fused to obtain the first fusion feature capability. It also has the first fusion feature and the second fusion feature Perform fusion and get the ability to merge features.
  • Figure 12a is a schematic diagram of another architecture for training a multimodal speech recognition model.
  • Figure 12b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
  • Step S121 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in FIG. 12a, that is, the noise-free spectrogram feature) and the acoustic feature (ie, the noise-free spectrogram feature in FIG. 12a) through the multimodal speech recognition model.
  • the clear acoustic characteristics of the noise-free speech signal in the training sample ie, the noise spectrogram feature in Figure 12a
  • the acoustic feature ie the noise acoustic feature in Figure 12a
  • Step S122 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S123 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
  • Step S124 Fuse the noise spectrogram feature and the first fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
  • Step S125 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
  • Step S126 Through the multimodal speech recognition model, the first fusion feature of the training sample approaches the noise-free acoustic feature, the fusion feature of the training sample approaches the feature of the noise-free spectrogram, and the speech recognition result corresponding to the training sample approaches
  • the sample label of the training sample is the target, and the parameters of the multi-modal speech recognition model are updated.
  • the first difference between the first fusion feature of the training sample and the noise-free acoustic feature can be calculated by the first loss function
  • the second difference between the fusion feature of the training sample and the noise-free spectrogram feature can be calculated by the first loss function
  • the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
  • the acoustic feature extraction module is equipped with acoustic features for extracting voice signals and a spectrogram feature extraction module for voice signal extraction
  • the spectrogram feature and image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the spectrogram feature after the noise is removed from the voice signal.
  • the acoustic feature extracted from the voice signal is paired with the image feature extraction module 32
  • the fusion feature of the image feature sequence extracted from the image sequence is close to the acoustic feature after the noise is removed from the voice signal as the extraction direction, and the ability to extract the acoustic feature from the voice signal;
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the voice The ability of the signal to remove the noise characteristics of the spectrogram;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
  • the feature after the fusion is close to the voice
  • the spectrogram feature after the signal is denoised, the image feature sequence extracted from the image sequence and the acoustic feature extracted by the acoustic feature extraction module from the voice signal.
  • the feature after fusion is close to the acoustic feature after removing the noise from the voice signal as the extraction direction.
  • the feature fusion module 33 is provided with the first fusion feature obtained by the fusion approaching the acoustic feature after noise removal from the speech signal as the fusion direction, and the acoustic feature and the image feature sequence are fused to obtain the first fusion feature to be fused.
  • the features tend to be the fusion direction of the spectrogram feature after noise removal from the speech signal.
  • the spectrogram feature and the first fusion feature are fused to obtain the ability to fuse features.
  • Figure 13a is a schematic diagram of another architecture for training a multimodal speech recognition model.
  • Figure 13b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
  • Step S131 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in Figure 13a, that is, the noise-free spectrogram feature) and the acoustic feature (ie, the noise-free spectrogram feature in Figure 13a) through the multimodal speech recognition model.
  • Step S132 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
  • Step S133 Fuse the noise spectrogram feature and the image feature sequence through the multimodal speech recognition model to obtain the second fusion feature of the training sample.
  • Step S134 Fuse the noise acoustic feature and the second fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
  • Step S135 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
  • Step S136 Through the multi-modal speech recognition model, the second fusion feature of the training sample approaches the noise-free spectrogram feature, the fusion feature of the training sample approaches the noise-free acoustic feature, and the speech recognition result corresponding to the training sample approaches
  • the sample label of the training sample is the target, and the parameters of the multi-modal speech recognition model are updated.
  • the first difference between the second fusion feature of the training sample and the noise-free spectrogram feature can be calculated through the first loss function
  • the second difference between the fusion feature of the training sample and the noise-free acoustic feature can be calculated through the first loss function
  • Calculate the third difference between the speech recognition result corresponding to the training sample and the sample label of the training sample through the second loss function and update the parameters of the multi-modal speech recognition model according to the weighted sum of the first difference, the second difference and the third difference .
  • the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
  • the acoustic feature extraction module is equipped with the acoustic features for the voice signal extraction and the spectrogram feature extraction module for the voice signal extraction
  • the feature of the spectrogram feature and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the acoustic feature after removing the noise from the voice signal, which is the extraction direction, and the ability to extract the acoustic feature from the voice signal;
  • the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
  • the feature after the fusion of the image feature sequence is close to the voice
  • the acoustic features of the signal after noise removal, the spectrogram feature extracted from the voice signal and the image feature sequence extracted by the image feature extraction module 32, the feature after fusion is close to the ability of the spectrogram feature after noise removal from the voice signal;
  • the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
  • the feature after the fusion is close to the voice
  • the feature after fusion is close to the spectrogram feature after the noise is removed from the voice signal.
  • the extraction direction The ability to extract image feature sequences from image sequences;
  • the feature fusion module 33 is provided with the second fusion feature obtained by the fusion approaching the spectrogram feature after noise removal from the speech signal as the fusion direction, and the spectrogram feature and the image feature sequence are fused to obtain the second fusion feature, which is then fused to obtain The fusion feature of is close to the fusion direction of the acoustic feature after noise removal from the speech signal, and the acoustic feature and the second fusion feature are fused to obtain the ability of the fusion feature.
  • the weight of each difference is not limited, and the weight corresponding to each difference may be the same or different.
  • the weight of each difference can be set in advance, or it can be learned during the training process of the multi-modal speech recognition model. Taking the embodiment shown in FIG. 5a as an example, optionally, the weight of the first difference may be 0.2, and the weight of the second difference may be 0.8.
  • the first loss function may be an L2 norm or an L1 norm
  • the second loss function may be a cross-entropy function
  • the inventor of the present application found that the amount of audio/video data collected synchronously is usually small, and the multi-modal speech recognition model obtained by training with only synchronously collected audio/video data as a training sample may cause overfitting. To avoid over-fitting, and to further improve the recognition accuracy of the multi-modal speech recognition model, some functional modules can be pre-trained before training the multi-modal speech recognition model.
  • the initial parameters of the acoustic feature extraction module of the speech information extraction module 31 are trained with the speech signal and its corresponding speech content as the training data.
  • the parameters of the feature extraction module used to extract the acoustic features of the speech signal are trained with the speech signal and its corresponding speech content as the training data.
  • the initial parameters of the acoustic feature extraction module are the parameters of the feature extraction module in the speech recognition model trained with pure speech samples.
  • the specific architecture of the speech recognition model is not limited, but no matter what the architecture of the speech recognition model is, the feature extraction module is a necessary functional module.
  • the speech recognition model may include: a feature extraction module for extracting hidden layer features of the acoustic features of the input speech recognition model; a recognition module for extracting hidden layer features based on the feature extraction module Perform voice recognition.
  • the training process of the speech recognition model can refer to the existing training methods, which will not be described in detail here.
  • the speech samples used to train the speech recognition model may include the speech samples used to train the aforementioned multi-modal speech recognition model, or may not include the aforementioned speech samples used to train the aforementioned multi-modal speech recognition model. There is no specific limitation.
  • the initial parameters of the spectrogram feature extraction module are: the speech signal and its corresponding spectrogram label are used as training data in the trained speech separation model , The parameter of the spectrogram feature extraction module used for feature extraction of the spectrogram of the speech signal.
  • the initial parameters of the spectrogram feature extraction module are the parameters of the spectrogram feature extraction module in the speech separation model trained with pure speech samples.
  • the specific architecture of the speech separation model is not limited, but no matter what the architecture of the speech separation model is, the spectrogram feature extraction module is a necessary functional module.
  • the voice separation model may include: a spectrogram feature extraction module for extracting hidden layer features of the spectrogram of the input voice separation model; a separation module for extracting features based on the spectrogram feature extraction module
  • the hidden layer features are used for speech separation.
  • the training process of the speech separation model can refer to the existing training methods, which will not be detailed here.
  • the voice samples used to train the voice separation model may include the voice samples used to train the multimodal voice recognition model, or may not include the voice samples used to train the multimodal voice recognition model. This application does not Make specific restrictions.
  • the initial parameters of the image feature extraction module are: The parameters of the image feature extraction module used for feature extraction of the image sequence.
  • the initial parameters of the image feature extraction module are the parameters of the image feature extraction module in the lip language recognition model trained with pure image sequence samples.
  • the specific architecture of the lip language recognition model is not limited, but no matter what the architecture of the lip language recognition model is, the image feature extraction module is a necessary functional module.
  • the lip language recognition model may include: an image feature extraction module for extracting the hidden layer feature sequence of the image sequence of the input lip language recognition model; the recognition module for extracting the module based on the image feature The extracted hidden layer feature sequence is used for lip recognition.
  • the training process of the lip recognition model can refer to the existing training methods, which will not be detailed here.
  • the image sequence samples used to train the lip language recognition model may include the image sequence samples used to train the above-mentioned multi-modal speech recognition model, or may not include the above-mentioned image sequence samples used to train the above-mentioned multi-modal speech recognition model. This application does not specifically limit this.
  • the recognition module 22 uses the fusion feature to perform speech recognition, and the obtained speech recognition result is usually a phoneme-level recognition result, such as a triphone. After the triphone is obtained, the phoneme can be decoded by the Viterbi algorithm. Text sequence.
  • the specific decoding process can refer to existing methods, which will not be described in detail here.
  • the voice signal input to the multi-modal voice recognition model may be an acoustic feature extracted from the original voice signal and/or a spectrogram obtained from the original voice signal through short-time Fourier transform.
  • the input to the multi-modal speech recognition model is the acoustic features extracted from the original speech signal (for example, the fbank feature); taking the fbank feature as an example, you can pass
  • the sliding window extracts the fbank feature, where the window length can be 25ms, and the frame shift is 10ms, that is, the voice signals of two adjacent sliding window positions have an overlap of 15ms.
  • the 40-dimensional fbank feature (of course, it can also be other dimensions, and this application does not specifically limit it) vector, the fbank feature obtained in this way is a 100fps fbank feature vector sequence.
  • the feature extracted by the multi-modal speech recognition model from the input fbank feature is the hidden layer feature of the fbank feature.
  • the input to the multi-modal speech recognition model is the spectrogram obtained from the original speech signal through short-time Fourier transform; the multi-modal speech recognition model What is extracted from the input spectrogram is the hidden layer feature of the spectrogram.
  • the input to the multi-modal speech recognition model is the acoustic features extracted from the original speech signal and the original speech
  • the signal is a spectrogram obtained by short-time Fourier transform.
  • the frame rate of the video is usually 25fps.
  • the text annotation of the sample speech signal is also preprocessed. Specifically, forcealignment can be used to pronounce the text.
  • the phonemes are aligned to the voice signal, where every 4 frames of voice signal (every time the sliding window slides to a position, a frame of voice signal is determined) corresponds to a triphone, so that the text label is actually converted into a triphone label.
  • the marked frame rate is 25fps, which is a quarter of the audio frame rate and is synchronized with the video frame rate.
  • the noisy speech signal input to the multimodal speech recognition model can be a speech frame of 100fps (for ease of description, it is recorded as a noisy speech frame, and the noisy speech frame passes
  • the initial fbank feature vector sequence (for ease of description, it is recorded as the initial noise fbank feature vector sequence)
  • the initial fbank feature vector sequence of the initial noise fbank feature vector sequence is obtained by sliding the sliding window with a frame shift of 10ms in the original noise speech signal.
  • Each of the initial noise fbank eigenvectors is a 40-dimensional eigenvector.
  • the noise-free speech signal input to the multi-modal speech recognition model can be a 100fps speech frame (for ease of description, it is recorded as a noise-free speech frame.
  • the noise-free speech frame has a window length of 25ms and a frame shift of 10ms.
  • the initial fbank feature vector sequence of the window sliding in the original noise-free speech signal (for ease of description, it is recorded as the initial noise-free fbank feature vector sequence), and each initial noise-free fbank feature in the initial noise-free fbank feature vector sequence
  • the vectors are all 40-dimensional feature vectors.
  • the initial noise fbank feature vector sequence will be sampled 4 times in the time dimension after passing through the acoustic feature extraction module to obtain a 512-dimensional noise fbank feature vector sequence at 25fps; the initial noise-free fbank feature vector sequence will be in the time dimension after the acoustic feature extraction module Down-sampling is 4 times, and a 512-dimensional noise-free fbank feature vector sequence of 25fps is obtained.
  • the image sequence input to the multimodal speech recognition model can be a 25fps image sequence, an RGB three-channel image with an image size of 80 ⁇ 80, and a 512-dimensional image feature vector sequence at 25fps after an image feature extraction module.
  • the 25fps 512-dimensional noise fbank feature vector sequence and the 25fps 512-dimensional image feature vector sequence are input to the feature fusion module.
  • the feature fusion module receives a noise fbank feature vector and an image feature vector, the noise fbank feature vector and the image
  • the feature vector is fused (for example, the noise fbank feature vector and the image feature vector are spliced), and then a 512-dimensional fusion feature vector is generated through a small fusion neural network, and the 512-dimensional fusion feature vector is output to the recognition module.
  • the recognition module performs phoneme recognition on the received 512-dimensional fusion feature vector after softmax classification, and obtains the triphone recognition result.
  • the loss function used to update the parameters of the multimodal speech recognition model consists of two parts:
  • the 512-dimensional fusion feature vector is combined with the corresponding
  • the 512-dimensional noise-free fbank feature vector is used as part of the loss function as the L2 norm, so that the fused feature vector is closer to the corresponding 512-dimensional noise-free fbank feature vector, thereby achieving the effect of reducing noise at the feature level.
  • the cross-entropy function of the recognition result of the recognition module and the triphone label is calculated as another part of the loss function.
  • the speech signal input to the multimodal speech recognition model can be the initial fbank feature vector sequence of a 100fps speech frame; the initial fbank feature vector sequence will be in time after passing through the acoustic feature extraction module The dimensionality is down-sampled by 4 times to obtain a 512-dimensional fbank feature vector sequence at 25fps; the image sequence input to the multimodal speech recognition model can be a 25fps image sequence, and the image size is 80 ⁇ 80 RGB three-channel image, after image feature extraction After the module, a 25fps 512-dimensional image feature vector sequence is obtained; a 25fps 512-dimensional fbank feature vector sequence and a 25fps 512-dimensional image feature vector sequence are input to the feature fusion module.
  • the feature fusion module receives a fbank feature vector and an image feature each time Vector, the fbank feature vector and the image feature vector are fused to generate a 512-dimensional fusion feature vector, and the 512-dimensional fusion feature vector is output to the recognition module.
  • the recognition module performs phoneme recognition on the received 512-dimensional fusion feature vector after softmax classification, and obtains the triphone recognition result.
  • the inventor of the present application found that the current multimodal speech recognition method that assists speech recognition with the help of lip motion videos is extremely sensitive to the training data set. For example, if most of the data in the training set is English data, a small amount is The addition of Chinese data and lip motion information may cause Chinese to be recognized as English under high noise, but it reduces the effect of speech recognition.
  • the solution based on this application can significantly alleviate the recognition confusion caused by the imbalance of the language of the training data set, and further improve the multi-modal speech recognition effect in a high-noise environment.
  • the multi-modal speech recognition model of the present application has low dependence on the training set. Even if the language distribution of the samples in the training data set is not uniform, the trained multi-modal speech recognition model can accurately perform multi-lingual (can be The recognized language is the language included in the training sample set) speech recognition, which greatly reduces the problem of recognition confusion.
  • the training sample set used for training the above-mentioned multi-modal speech recognition model may only include training samples of a single language, or may include training samples of two or more languages.
  • the proportion of training samples of each language in the training sample set is randomly determined, or is a preset proportion.
  • test set for testing is mainly English corpus, and only a small part of Chinese corpus.
  • the embodiment of the present application also provides a voice recognition device.
  • a schematic structural diagram of the voice recognition device provided in the embodiment of the present application is shown in FIG. 14, and may include:
  • the acquiring module 141 is configured to acquire a voice signal and an image sequence collected synchronously with the voice signal; the image in the image sequence is an image of a region related to lip movement;
  • the feature extraction module 142 is configured to use the voice information that tends to remove the noise from the voice signal as the acquisition direction, and obtain the information that merges the voice signal and the image sequence as the fusion information;
  • the recognition module 143 is configured to perform voice recognition using the fusion information to obtain a voice recognition result of the voice signal.
  • the speech recognition device when acquiring the fusion characteristics of the speech signal and the image sequence, takes the fusion information approaching the speech information after the speech signal denoising as the acquisition direction, that is, the obtained fusion information
  • the voice information that is close to the noise-free voice signal reduces the interference of the noise in the voice signal on the voice recognition, thereby improving the voice recognition rate.
  • the functions of the feature extraction module 142 and the recognition module 143 can be implemented by a multi-modal speech recognition model, specifically:
  • the feature extraction module 142 may be specifically used to: use a multi-modal speech recognition model to obtain information that is close to removing noise from the speech signal as an acquisition direction, and obtain information that merges the speech signal and the image sequence as a fusion information;
  • the recognition module 143 may be specifically used for the ability to perform voice recognition using the fusion information through a multimodal voice recognition model, and obtain the voice recognition result of the voice signal.
  • the feature extraction module 142 may be specifically configured to: take the voice information obtained by removing noise from the voice signal as the acquisition direction, and use the voice information extraction of the multi-modal voice recognition model.
  • the module extracts speech information from the speech signal, uses the image feature extraction module of the multimodal speech recognition model to extract the image feature sequence from the image sequence; uses the feature fusion module of the multimodal speech recognition model to Fusing the voice information and the image feature sequence to obtain a fusion feature that merges the voice signal and the image sequence;
  • the recognition module 143 may be specifically configured to: use a recognition module of a multi-modal speech recognition model to perform speech recognition based on the fusion feature to obtain a speech recognition result of the speech signal.
  • the feature extraction module 142 may specifically include:
  • the extraction module is used to take the speech information extracted from the speech signal and the image feature sequence extracted from the image sequence as the extraction direction, and the feature after the fusion of the image feature sequence is close to the speech information after removing the noise from the speech signal.
  • the voice information extraction module of the multimodal voice recognition model extracts voice information from the voice signal, and the image feature extraction module of the multimodal voice recognition model extracts an image feature sequence from the image sequence;
  • the fusion module is used to use the feature fusion module of the multi-modal speech recognition model to perform the fusion of the voice information and the image feature sequence with the voice information approaching the voice signal after the noise is removed as the fusion direction. Fusion, get fusion features.
  • the voice information is of N types, and the N is a positive integer greater than or equal to 1.
  • the extraction module uses the voice information extraction module of the multi-modal speech recognition model to obtain the When extracting voice information from the signal, it is specifically used for:
  • the feature after the fusion of the extracted N types of voice information and the image feature sequence extracted from the image sequence approaches the one after removing the noise from the voice signal.
  • One type of voice information is the extraction direction, and N types of voice information are extracted from the voice signal; or,
  • the voice information extraction module of the multi-modal voice recognition model is used, so that each of the extracted voice information and the feature after the fusion of the image feature sequence extracted from the image sequence are close to the opposite.
  • the voice information after noise removal from the voice signal is the extraction direction, and N voice information is extracted from the voice signal.
  • the voice information is an acoustic feature and/or a spectrogram feature
  • the fusion module may be specifically used for:
  • Fusion method 1 Use the feature fusion module of the multi-modal speech recognition model to fuse the acoustic features and the image feature sequence with the acoustic features that tend to denoise the speech signal as the fusion direction , Get the fusion feature corresponding to the fusion mode;
  • Fusion method 2 Using the feature fusion module of the multi-modal speech recognition model, take the spectrogram feature after denoising the speech signal as the fusion direction, and compare the spectrogram feature and the image feature sequence Perform fusion to obtain the fusion feature corresponding to fusion mode two;
  • Fusion method 3 Use the feature fusion module of the multi-modal speech recognition model to use the acoustic features or spectrogram features that tend to denoise the voice signal as the fusion direction, and compare the acoustic features and the frequency spectrum The image feature and the image feature sequence are fused to obtain the fusion feature corresponding to the three fusion modes.
  • the speech recognition device further includes a training module for:
  • the fusion feature of the training sample is close to the noise-free speech information, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample.
  • the parameters of the multi-modal speech recognition model are updated.
  • the initial parameters of the acoustic feature extraction module are: Parameters of the feature extraction module used for acoustic feature extraction of speech signals.
  • the initial parameters of the spectrogram feature extraction module are the speech separation model trained using the speech signal and its corresponding spectrogram label as the training data , The parameters of the spectrogram feature extraction module used to perform feature extraction on the spectrogram of the speech signal.
  • the initial parameters of the image feature extraction module are: the lip language recognition model trained with the image sequence and its corresponding pronunciation content as the training data , The parameters of the image feature extraction module used for feature extraction of the image sequence.
  • the training sample set used for training the multimodal speech recognition model includes training samples of different languages, and the proportion of training samples of each language in the training sample set is randomly determined , Or a preset ratio.
  • FIG. 15 shows a block diagram of the hardware structure of the speech recognition device.
  • the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication Bus 4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
  • the processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high-speed RAM memory, or may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used for:
  • the image in the image sequence is an image of a region related to lip movement
  • the embodiments of the present application also provide a storage medium, the storage medium may store a program suitable for execution by a processor, and the program is used for:
  • the image in the image sequence is an image of a region related to lip movement
  • the disclosed system, device, and method can be implemented in other ways.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

一种语音识别方法、装置、设备及存储介质,在获取语音信号和与语音信号同步采集的图像序列(S11)后,以趋近于对语音信号去除噪声后的语音信息为获取方向,获取融合语音信号和图像序列的信息,作为融合信息(S12);利用融合信息进行语音识别,得到语音信号的语音识别结果(S13)。该语音识别方案,在获取语音信号和图像序列的融合特征时,是以融合信息趋近于对语音信号去噪后的语音信息为获取方向的,即所获得到的融合信息趋近于无噪声语音信号的语音信息,降低了语音信号中的噪声对语音识别的干扰,从而提高语音识别率。

Description

语音识别方法、装置、设备及存储介质
本申请要求于2020年02月28日提交中国专利局、申请号为202010129952.9、发明创造名称为“语音识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理技术领域,更具体地说,涉及一种语音识别方法、装置、设备及存储介质。
背景技术
传统的语音识别技术是单语音识别,即通过仅对语音信号进行处理得到识别结果,这种语音识别方法在语音清晰的环境下已经能够达到很高的识别效果。然而,在一些高噪声,远场的环境下,传统的语音识别技术的识别率会迅速下降。为了提高语音识别率,有方案提出借助唇部动作视频协助进行语音识别的多模态语音识别方法,在一定程度上提高了高噪声场景下语音的识别率。
然而,现有的多模态语音识别方法是利用唇部动作视频进行唇语识别,然后根据唇语识别结果和单语音识别结果准确度确定最终的语音识别结果,其语音识别效果仍然较低。
因此,如何提高多模态语音识别方法的识别率成为亟待解决的技术问题。
发明内容
有鉴于此,本申请提供了一种语音识别方法、装置、设备及存储介质,以提高多模态语音识别方法的识别率。
为了实现上述目的,现提出的方案如下:
一种语音识别方法,包括:
获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;
以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
一种语音识别装置,包括:
获取模块,用于获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列 中的图像为唇动相关区域的图像;
特征提取模块,用于以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
识别模块,用于利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
一种语音识别设备,包括存储器和处理器;
所述存储器,用于存储程序;
所述处理器,用于执行所述程序,实现如上任一项所述的语音识别方法的各个步骤。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上任一项所述的语音识别方法的各个步骤。
从上述的技术方案可以看出,本申请实施例提供的语音识别方法、装置、设备及存储介质,在获取语音信号和与语音信号同步采集的图像序列后,以趋近于对语音信号去除噪声后的语音信息为获取方向,获取融合语音信号和图像序列的信息,作为融合信息;利用融合信息进行语音识别,得到语音信号的语音识别结果。本申请实施例提供的语音识别方案,在获取语音信号和图像序列的融合特征时,是以融合信息趋近于对语音信号去噪后的语音信息为获取方向的,即所获得到的融合信息趋近于无噪声语音信号的语音信息,降低了语音信号中的噪声对语音识别的干扰,从而提高语音识别率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例公开的语音识别方法的一种实现流程图;
图2为本申请实施例公开的多模态语音识别模型的一种结构示意图;
图3为本申请实施例公开的融合特征获取模块的一种结构示意图;
图4a为本申请实施例公开的对多模态语音识别模型进行训练的一种架构示意图;
图4b为本申请实施例公开的对多模态语音识别模型进行训练的一种实现流程图;
图5a为本申请实施例公开的对多模态语音识别模型进行训练的一种架构示意图;
图5b为本申请实施例公开的对多模态语音识别模型进行训练的一种实现流程图;
图6a为本申请实施例公开的对多模态语音识别模型进行训练的一种架构示意图;
图6b为本申请实施例公开的对多模态语音识别模型进行训练的一种实现流程图;
图7a为本申请实施例公开的对多模态语音识别模型进行训练的一种架构示意图;
图7b为本申请实施例公开的对多模态语音识别模型进行训练的一种实现流程图;
图8a为本申请实施例公开的对多模态语音识别模型进行训练的另一种架构示意图;
图8b为本申请实施例公开的对多模态语音识别模型进行训练的另一种实现流程图;
图9a为本申请实施例公开的对多模态语音识别模型进行训练的又一种架构示意图;
图9b为本申请实施例公开的对多模态语音识别模型进行训练的又一种实现流程图;
图10a为本申请实施例公开的对多模态语音识别模型进行训练的又一种架构示意图;
图10b为本申请实施例公开的对多模态语音识别模型进行训练的又一种实现流程图;
图11a为本申请实施例公开的对多模态语音识别模型进行训练的又一种架构示意图;
图11b为本申请实施例公开的对多模态语音识别模型进行训练的又一种实现流程图;
图12a为本申请实施例公开的对多模态语音识别模型进行训练的又一种架构示意图;
图12b为本申请实施例公开的对多模态语音识别模型进行训练的又一种实现流程图;
图13a为本申请实施例公开的对多模态语音识别模型进行训练的又一种架构示意图;
图13b为本申请实施例公开的对多模态语音识别模型进行训练的又一种实现流程图;
图14为本申请实施例公开的语音识别装置的一种结构示意图;
图15为本申请实施例公开的语音识别设备的硬件结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的发明人研究发现,目前的借助唇部动作视频协助进行语音识别的多模态语音识别方法,是利用唇语识别结果的准确度和单语音识别结果的准确度对比,将准确度高的结果作为最终的语音识别结果,从而在一定程度上提高语音识别率。但是,该多模特语音识别方法的实质是唇语识别结果对语音识别结果的修正能力,其并没有发掘视频信号对高噪声语音信号的修正能力,因而难以获得高质量的识别效果。
为了提高高噪声场景下的语音识别效果,本申请的基本思想是,把降噪的思想显式的加入到多模态语音识别任务中,从而能更好的提取视频信息对语音信息的修正作用,达到更好的识别效果。
基于上述基本思想,本申请实施例提供的语音识别方法的一种实现流程图如图1所示,可以包括:
步骤S11:获取语音信号和与语音信号同步采集的图像序列;该图像序列中的图像为唇动相关区域的图像。
本申请实施例中,在采集讲话者的语音信号的同时,还采集该讲话者的脸部视频。上述图像序列即为对讲话者的脸部视频中的各帧图像裁剪唇动相关区域得到的图像序列。比如,可以在脸部视频的各帧图像中,以嘴部中心点为中心,取固定大小(比如,80×80)的区域作为目标图像序列。
其中,唇动相关区域可以是指仅唇部区域;或者,
唇动相关区域可以是唇部及其周围区域,比如,唇部和下巴区域;或者,
唇动相关区域可以是整个脸部区域。
步骤S12:以趋近于对语音信号去除噪声后的语音信息为获取方向,获取融合语音信号和图像序列的信息,作为融合信息。
对语音信号去除噪声后的语音信息可以是指:从对语音信号进行去噪处理得到的降噪语音信号中提取的信息。
本申请实施例中,通过融合语音信号和图像序列,得到趋近于降噪语音信号中的语音信息的融合信息,相当于对语音信号进行了降噪处理。
步骤S13:利用融合信息进行语音识别,得到语音信号的语音识别结果。
由于融合信息趋近于降噪后的语音信号中的语音信息,因此,利用融合信息进行语音识别降低了语音信号中的噪声对语音识别的干扰,从而提高语音识别的准确率。
在一可选的实施例中,可以利用多模态语音识别模型获取融合信息,并利用融合信息进行语音识别,得到语音信号的语音识别结果。具体的,
可以利用多模态语音识别模型处理语音信号和图像序列,得到多模态语音识别模型输出的语音识别结果;
其中,多模态语音识别模型具备以趋近于对语音信号去除噪声后的信息为获取方向, 获取融合语音信号和图像序列的信息,作为融合信息;利用该融合信息进行语音识别,得到语音信号的语音识别结果的能力。
如图2所示,为本申请实施例提供的多模态语音识别模型的一种结构示意图,可以包括:
融合特征获取模块21和识别模块22;其中,
融合特征获取模块21用于以趋近于对语音信号去除噪声后的语音信息为获取方向,获取融合语音信号和图像序列的融合特征。
识别模块22用于基于融合特征获取模块21获取的融合特征进行语音识别,得到语音信号的语音识别结果。
基于图2所示多模态语音识别模型,前述利用多模态语音识别模型处理语音信号和图像序列,得到多模态语音识别模型输出的语音识别结果的具体实现过程可以为:
利用多模态语音识别模型的融合特征获取模块21,以趋近于对语音信号去除噪声后的语音信息为获取方向,获取融合语音信号和图像序列的融合特征;
利用多模态语音识别模型的识别模块22,基于融合特征获取模块21获取的融合特征进行语音识别,得到语音信号的语音识别结果。
在一可选的实施例中,融合特征获取模块21的一种结构示意图如图3所示,可以包括:
语音信息提取模块31,图像特征提取模块32和特征融合模块33;其中,
语音信息提取模块31用于以对语音信号提取的语音信息与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的语音信息为提取方向,从语音信号中提取语音信息。
本申请实施例中,语音信息提取模块31在从语音信号中提取语音信息时,以从语音信号中提取的语音信息与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去噪后的语音信息为提取方向,从语音信号中提取语音信息。
图像特征提取模块32用于以对图像序列提取的图像特征序列与语音信息提取模块31对语音信号提取的语音信息融合后的特征趋近于对语音信号去除噪声后的语音信息为提取方向,从图像序列中提取图像特征序列。
本申请实施例中,图像特征提取模块32在从图像序列中提取图像特征序列时,以从图像序列中提取的图像特征序列与语音信息提取模块31从语音信号中提取的语音信息融合 后的特征趋近于对语音信号去噪后的语音信息为提取方向,从图像序列中提取图像特征序列。
特征融合模块33用于以趋近于对语音信号去除噪声后的语音信息为融合方向,对提取的语音信息和图像特征序列进行融合,得到融合特征。
本申请实施例中,特征融合模块33在对语音信号和图像特征序列进行融合时,以融合特征趋近于对语音信号去除噪声后的语音信息为融合方向,对提取的语音信号和图像特征序列进行融合。
本申请实施例中,不管是进行语音信息提取,还是进行图像特征提取,还是对提取的语音信息和图像特征序列进行融合,均以提取的语音信息和图像特征序列融合后的特征趋近于对语音信号去除噪声后的语音信息为方向而执行。
基于上述融合特征获取模块21的结构,上述利用融合特征获取模块21以趋近于对语音信号去除噪声后的语音信息为获取方向,获取融合语音信号和图像序列的融合特征的一种实现方式可以为:
以趋近于对语音信号去除噪声后的语音信息为获取方向,利用语音信息提取模块31从语音信号中提取语音信息,利用图像特征提取模块32从图像序列中提取图像特征序列;利用特征融合模块33对语音信息提取模块31提取的语音信息和图像特征提取模块32提取的图像特征序列进行融合,获取融合语音信号和图像序列的融合特征。具体可以为:
以对语音信号提取的语音信息与从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的语音信息为提取方向,利用语音信息提取模块31从语音信号中提取语音信息,利用图像特征提取模块32从图像序列中提取图像特征序列。
以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对提取的语音信息和图像特征序列进行融合,得到融合特征。
在一可选的实施例中,从语音信号中提取的语音信息可以为N种,N为大于或等于1的正整数。则上述利用语音信息提取模块31从语音信号中提取语音信息的过程可以包括如下两种提取方式中的任意一种:
提取方式一:利用语音信息提取模块31,以提取的N种语音信息与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的一种语音信息为提取方向,从语音信号中提取N种语音信息。
该提取方式一中,不管语音信息提取模块31提取的语音信息为几种,均以融合后的特征趋近于对语音信号去除噪声后的一种语音信息为提取方向。具体的,
若从语音信号中提取的语音信息为一种(为便于叙述记为目标种类),则提取方式一的具体实现方式可以为:
利用语音信息提取模块31以提取的该目标种类的语音信息与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的该目标种类的语音信息为提取方向,从语音信号中提取该目标种类的语音信息。
若从语音信号中提取的语音信息为至少两种,即N大于1,则提取方式一的具体实现方式可以为:
利用语音信息提取模块31以提取的N种语音信息与从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的其中一种语音信息为提取方向,从语音信号中提取N种语音信息。
本申请实施例中,虽然需要提取至少两种语音信息,但在提取该至少两种语音信息时,是以其中一种语音信息(去噪后的)为提取方向进行提取的。比如,假设提取的语音信息为两种,分别为A类语音信息和B类语音信息,则本申请实施例中,
可以利用语音信息提取模块31以提取的A类语音信息和B类语音信息与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的A类语音信息为提取方向,从语音信号中提取A类语音信息和B类语音信息。
或者,
可以利用语音信息提取模块31以提取的A类语音信息和B类语音信息与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的B类语音信息为提取方向,从语音信号中提取A类语音信息和B类语音信息。
提取方式二:若N大于1,则利用语音信息提取模块31以提取的每一种语音信息与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的该种语音信息为提取方向,从语音信号中提取N种语音信息。
该提取方式二中,对于每一种语音信息,以语音信息提取模块31提取的该种语音信息与图像特征序列融合后的特征趋近于对语音信号去除噪声后的该种语音信息为提取方向,从语音信号中提取N种语音信息。其中,
该种语音信息与图像特征序列融合包括:该种语音信息仅与图像特征序列融合。或者, 将该种语音信息,以及,图像特征序列和提取的其它种语音信息的融合特征进行融合。
在一可选的实施例中,从语音信号中提取的语音信息可以为仅为声学特征(比如,fbank特征,或者,Mel频率倒谱系数MFCC特征),或者,可以仅为频谱图特征,或者,可以包括声学特征和频谱图特征。
上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33对语音信息和图像特征序列进行融合,获取融合语音信号和图像序列的融合特征的过程可以包括:
根据如下三种融合方式中的任意一种或任意两种的组合得到的融合特征获取融合语音信号和图像序列的融合特征:
融合方式一:利用特征融合模块33,以趋近于对语音信号去噪后的声学特征为融合方向,对提取的声学特征和图像特征序列进行融合,得到融合方式一对应的融合特征;
融合方式二:利用特征融合模块33,以趋近于对语音信号去噪后的频谱图特征为融合方向,对提取的频谱图特征和图像特征序列进行融合,得到融合方式二对应的融合特征;
融合方式三:利用特征融合模块33,以趋近于对语音信号去噪后的声学特征或频谱图特征为融合方向,对提取的声学特征、频谱图特征和图像特征序列进行融合,得到融合方式三对应的融合特征。
当根据上述任意一种融合方式得到的融合特征获取融合语音信号和图像序列的融合特征时,该种融合方式对应的融合特征即为融合语音信号和图像序列的融合特征。比如,若根据融合方式一得到的融合特征获取融合语音信号和图像序列的融合特征,则上述融合方式一对应的融合特征即为融合语音信号和图像序列的融合特征;若根据融合方式二得到的融合特征获取融合语音信号和图像序列的融合特征,则上述融合方式二对应的融合特征即为融合语音信号和图像序列的融合特征;同理,若根据融合方式三得到的融合特征获取融合语音信号和图像序列的融合特征,则上述融合方式三对应的融合特征即为融合语音信号和图像序列的融合特征。
当根据融合方式一和融合方式二得到的融合特征获取融合语音信号和图像序列的融合特征时,将融合方式一对应的融合特征和融合方式二对应的融合特征进行融合,得到融合语音信号和图像序列的融合特征;
当根据融合方式一和融合方式三得到的融合特征获取融合语音信号和图像序列的融合特征,或者,根据融合方式二和融合方式三得到的融合特征获取融合语音信号和图像序列 的融合特征时,融合方式三对应的融合特征即为融合语音信号和图像序列的融合特征。
下面以语音信息为声学特征和/或频谱图特征为例对提取语音信息和获取融合语音信号和图像序列的融合特征的过程进行解释说明。
可选的,若上述目标种类的语音信息为声学特征,则利用语音信息提取模块31从语音信号中提取目标种类的语音信息时,具体可以用于:
利用语音信息提取模块31以从语音信号中提取的声学特征与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征。可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块,用于以从语音信号中提取的声学特征与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征。
该示例中,输入多模态语音识别模型的语音信号可以是通过滑窗从原始语音信号(即音频采集装置采集的语音信号)中提取的声学特征(为便于叙述,记为初始声学特征),语音信息提取模块31从语音信号中提取的声学特征可以为初始声学特征的隐层特征。通过滑窗从原始语音信号中提取初始声学特征的具体实现过程可以参看已有的方案,这里不再详述。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的一种实现方式可以为:
利用特征融合模块33以趋近于对语音信号去除噪声后的声学特征为融合方向,对提取的声学特征和图像特征序列进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若上述目标种类的语音信息为频谱图特征,则利用语音信息提取模块31从语音信号中提取目标种类的语音信息时,具体可以包括:
利用语音信息提取模块31以从语音信号中提取的频谱图特征与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。可以利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模 块31包括频谱图特征提取模块,用于以从语音信号中提取的频谱图特征与图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。
该示例中,输入多模态语音识别模型的语音信号可以是通过对原始语音信号进行短时傅里叶变换得到的频谱图,语音信息提取模块31从语音信号中提取的频谱图特征可以为频谱图的隐层特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的另一种实现方式可以为:
利用特征融合模块33以趋近于对语音信号去除噪声后的频谱图特征为融合方向,对提取的频谱图特征和图像特征序列进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的一种实现方式可以为:
利用语音信息提取模块31以提取的频谱图特征、声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,声学特征提取模块用于以从语音信号中提取的声学特征,频谱图特征提取模块从语音信号中提取的频谱图特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取声学特征;频谱图特征提取模块用于以从语音信号中提取的频谱图特征,声学特征提取模块从语音信号中提取的声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的又一种实现方式可以为:
利用特征融合模块33的三号特征融合模块,对提取的声学特征和图像特征序列进行融合,得到第一融合特征;
利用特征融合模块33的五号特征融合模块,以趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和第一融合特征进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的另一种实现方式可以为:
利用语音信息提取模块31以提取的频谱图特征、声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,
声学特征提取模块用于以提取的声学特征,频谱图提取模块从语音信号中提取的频谱图特征,以及图像特征提取模块从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征;
频谱图特征提取模块用于以提取的频谱图特征,声学特征提取模块从语音信号中提取的声学特征,以及图像特征提取模块从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对提取的语音信息和图像特征序列进行融合,得到融合特征的一种实现方式可以为:
利用特征融合模块33的一号特征融合模块,对提取的频谱图特征和图像特征序列进行融合,得到第二融合特征;
利用特征融合模块33的二号特征融合模块,以趋近于对语音信号去除噪声后的声学特征为融合方向,对提取的声学特征和第二融合特征进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的又一种实现方式可以为:
利用语音信息提取模块31以提取的频谱图特征、声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征,以及提取的声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,
声学特征提取模块用于以提取的声学特征,频谱图特征提取模块从语音信号中提取的频谱图特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征,以及提取的声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的声学特征为提取方向,从语音信号中提取声学特征;
频谱图特征提取模块用于以提取的频谱图特征,声学特征提取模块从语音信号中提取的声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的又一种实现方式可以为:
利用特征融合模块33的一号特征融合模块,对提取的频谱图特征和图像特征序列进行融合,得到第二融合特征;
利用特征融合模块33的二号特征融合模块,以趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和第二融合特征进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的又一种实现方式可以为:
利用语音信息提取模块31以提取的频谱图特征、声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征,以及提取的频谱图特征和从图像序列中提取的图像特征序列融合后的特征趋近于去 除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,
声学特征提取模块用于以提取的声学特征,频谱图特征提取模块从语音信号中提取的频谱图特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取声学特征;
频谱图特征提取模块用于以提取的频谱图特征,声学特征提取模块从语音信号中提取的声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征,以及提取的频谱图特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的又一种实现方式可以为:
利用特征融合模块33的三号特征融合模块,对提取的声学特征和图像特征序列进行融合,得到第一融合特征;
利用特征融合模块33的五号特征融合模块,以趋近于对语音信号去除噪声后的频谱图特征为融合方向,对提取的频谱图特征和第一融合特征进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的又一种实现方式可以为:
利用语音信息提取模块31以提取的频谱图特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征,以及提取的声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,
声学特征提取模块用于以提取的声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的声学特征为提取方向,从语音信号中提取声学特征;
频谱图特征提取模块用于以提取的频谱图特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的另一种实现方式可以为:
利用特征融合模块33的三号特征融合模块,以趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和图像特征序列进行融合,得到第一融合特征;
利用特征融合模块33的一号特征融合模块,以趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和图像特征序列进行融合,得到第二融合特征;
利用特征融合模块33的四号特征融合模块,将第一融合特征和第二融合特征进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的又一种实现方式可以为:
利用语音信息提取模块31以提取的声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的声学特征,以及提取的频谱图特征、声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,
声学特征提取模块用于以提取的声学特征,频谱图特征提取模块从语音信号中提取的频谱图特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征,提取的声学特征与图像特征提取模块32提取的图像特征序列融合后的特征趋近于去除噪声后的声学特征为提取方向,从语音信号中提取 声学特征;
频谱图特征提取模块用于以提取的频谱图特征,声学特征提取模块从语音信号中提取的声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的又一种实现方式可以为:
利用特征融合模块33的三号特征融合模块,以趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和图像特征序列进行融合,得到第一融合特征;
利用特征融合模块33的五号特征融合模块,以趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和三号特征融合模块得到的第一融合特征进行融合,得到融合语音信号和图像序列的融合特征。
可选的,若从语音信号中提取两种语音信息,分别为声学特征和频谱图特征,则利用语音信息提取模块31从语音信号中提取两种语音信息的又一种实现方式可以为:
利用语音信息提取模块31以提取的频谱图特征和从图像序列中提取的图像特征序列融合后的特征趋近于去除噪声后的频谱图特征,以及提取的频谱图特征、声学特征和图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征和声学特征。可选的,可以利用语音信息提取模块31的声学特征提取模块从语音信号中提取声学特征,利用语音信息提取模块31的频谱图特征提取模块从语音信号中提取频谱图特征。也就是说,本申请实施例中,语音信息提取模块31包括声学特征提取模块和频谱图特征提取模块,其中,
声学特征提取模块用于以提取的声学特征,频谱图特征提取模块从语音信号中提取的频谱图特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征;
频谱图特征提取模块用于以提取的频谱图特征,声学特征提取模块从语音信号中提取的声学特征,以及图像特征提取模块32从图像序列中提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征,提取的频谱图特征与图像特征提取模块32提取的图像特征序列融合后的特征趋近于去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征。
相应的,上述以趋近于对语音信号去除噪声后的语音信息为融合方向,利用特征融合模块33,对语音信息和图像特征序列进行融合,得到融合特征的又一种实现方式可以为:
利用特征融合模块33的一号特征融合模块,以趋近于对语音信号去除噪声后的频谱图特征为融合方向,对提取的频谱图特征和图像特征序列进行融合,得到第二融合特征;
利用特征融合模块33的二号特征融合模块,以趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和第二融合特征进行融合,得到融合语音信号和图像序列的融合特征。
本申请的上述各个实施例中,输入多模态语音识别模型的语音信号可以是通过滑窗从原始语音信号中提取的初始声学特征,以及通过对原始语音信号进行短时傅里叶变换得到的频谱图,则语音信息提取模块31从语音信号中提取的声学特征可以是初始声学特征的隐层特征,从语音信号中提取的频谱图特征可以是频谱图的隐层特征。
下面说明多模态语音识别模型的训练过程。
在一可选的实施例中,请参阅图4a和图4b,其中,图4a为本申请实施例提供的对多模态语音识别模型进行训练的一种架构示意图,图4b为对多模态语音识别模型进行训练的一种实现流程图,可以包括:
步骤S41:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号(也可称为清晰语音信号)的无噪声语音信息(即图4a中的清晰语音信息),和训练样本中包含上述无噪声语音信号的噪声语音信号的噪声语音信息。
其中,可以通过对无噪声语音信号添加噪声生成噪声语音信号,比如,对无噪声语音信号分别加噪到信噪比snr=10、snr=5、snr=0三个程度来模拟真实场景中的噪声程度。
或者,可以通过对噪声语音信号进行去噪处理,得到无噪声语音信号。
步骤S42:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。可以通过图像特征提取模块32从样本图像序列中提取样本图像特征序列。
步骤S43:通过多模态语音识别模型将噪声语音信息和样本图像特征序列进行融合,得到训练样本的融合特征。可以通过特征融合模块33将噪声语音信息和样本图像特征序列进行融合,得到训练样本的融合特征。
步骤S44:通过多模态语音识别模型利用训练样本的融合特进行语音识别,得到训练样本对应的语音识别结果。可以通过识别模块22利用训练样本的融合特进行语音识别,得 到训练样本对应的语音识别结果。
步骤S45:通过多模态语音识别模型以训练样本的融合特征趋近于无噪声语音信息,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。具体的,可以通过第一损失函数计算训练样本的融合特征与无噪声语音信息的差异(为便于叙述,记为第一差异),通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的差异(为便于叙述,记为第二差异),根据第一差异和第二差异的加权和对多模态语音识别模型的参数进行更新。
基于图4a-图4b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,具备以趋近于对语音信号去除噪声后的信息为获取方向,获取融合语音信号和图像序列的信息,作为融合信息;利用该融合信息进行语音识别,得到语音信号的语音识别结果的能力。
下面根据语音信息的不同分别说明多模态语音识别模型的训练过程。
在一可选的实施例中,若语音信息仅为声学特征,请参看图5a和图5b,其中,图5a为对多模态语音识别模型进行训练的一种架构示意图,图5b为对多模态语音识别模型进行训练的一种实现流程图,可以包括:
步骤S51:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的声学特征(即图5a中的清晰声学特征,也可称为无噪声声学特征),和训练样本中包含上述无噪声语音信号的噪声语音信号的声学特征(即图5a中的噪声声学特征)。可以通过语音信息提取模块31的声学特征提取模块从无噪声语音信号中提取清晰声学特征,从噪声语音信号中提取噪声声学特征。噪声语音信号和无噪声语音信号的获取过程可以参看前述实施例,这里不再赘述。
步骤S52:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S53:通过多模态语音识别模型将噪声声学特征和样本图像特征序列进行融合,得到训练样本的融合特征。
步骤S54:通过多模态语音识别模型利用训练样本的融合特进行语音识别,得到训练样本对应的语音识别结果。
步骤S55:通过多模态语音识别模型以训练样本的融合特征趋近于无噪声声学特征, 训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算训练样本的融合特征和清晰声学特征的第一差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第二差异,根据第一差异和第二差异的加权和对多模态语音识别模型的参数进行更新。
基于图5a-图5b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合后的特征趋近于对语音信号去除噪声后的声学特征为融合方向,对提取的声学特征和图像特征序列进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息仅为频谱图特征,请参看图6a和图6b,其中,图6a为对多模态语音识别模型进行训练的一种架构示意图,图6b为对多模态语音识别模型进行训练的一种实现流程图,可以包括:
步骤S61:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的频谱图特征(即图6a中的清晰频谱图特征,也可称为无噪声频谱图特征),训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图6a中的噪声频谱图特征)。可以通过语音信息提取模块31的频谱图特征提取模块从无噪声语音信号中提取清晰频谱图特征,从噪声语音信号中提取噪声频谱图特征。噪声语音信号和无噪声语音信号的获取过程可以参看前述实施例,这里不再赘述。
步骤S62:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S63:通过多模态语音识别模型将噪声频谱图特征和样本图像特征序列进行融合,得到训练样本的融合特征。可以通过特征融合模块33将噪声语音信号的频谱图特征和样本图像特征序列进行融合,得到训练样本的融合特征。
步骤S64:通过多模态语音识别模型利用训练样本的融合特进行语音识别,得到训练样本对应的语音识别结果。
步骤S65:通过多模态语音识别模型以训练样本的融合特征趋近于无噪声频谱图特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算融合特征和清晰频谱图特征的第一差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第二差异,根据第一差异和第二差异的加权和对多模态语音识别模型的参数进行更新。
基于图6a-图6b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,频谱图特征提取模块具备以对语音信号提取的频谱图特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合后的特征趋近于对语音信号去除噪声后的频谱图特征为融合方向,对提取的频谱图特征和图像特征序列进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图7a和图7b,其中,图7a为对多模态语音识别模型进行训练的一种架构示意图,图7b为对多模态语音识别模型进行训练的一种实现流程图,可以包括:
步骤S71:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的频谱图特征(即图7a中的清晰频谱图特征,即无噪声频谱图特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图7a中的噪声频谱图特征)和声学特征(即图7a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S72:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S73:通过多模态语音识别模型对噪声声学特征和图像特征序列进行融合,得到训练样本的第一融合特征。
步骤S74:通过多模态语音识别模型对噪声语音信号的频谱图特征和训练样本的第一融合特征进行融合,得到训练样本的融合特征。
步骤S75:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S76:通过多模态语音识别模型以训练样本的融合特征趋近于无噪声频谱图特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算融合特征和清晰频谱图特征的第一差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第二差异,根据第一差异和第二差异的加权和对多模态语音识别模型的参数进行更新。
基于图7a-图7b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与频谱图特征提取模块对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与声学特征提取模块对语音信号提取的声学特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征和频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合后的特征趋近于对语音信号去除噪声后的频谱图特征为融合方向,对声学特征、频谱图特征和图像特征序列进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图8a和图8b,其中,图8a为对多模态语音识别模型进行训练的另一种架构示意图,图8b为对多模态语音识别模型进行训练的另一种实现流程图,可以包括:
步骤S81:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的声学特 征(即图8a中的清晰声学特征,即无噪声声学特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图8a中的噪声频谱图特征)和声学特征(即图8a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S82:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S83:通过多模态语音识别模型对噪声频谱图特征和图像特征序列进行融合,得到训练样本的第二融合特征。
步骤S84:通过多模态语音识别模型对噪声语音信号的声学特征和训练样本的第二融合特征进行融合,得到训练样本的融合特征。
步骤S85:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S86:通过多模态语音识别模型以训练样本的融合特征趋近于无噪声声学特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算融合特征和清晰声学特征的第一差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第二差异,根据第一差异和第二差异的加权和对多模态语音识别模型的参数进行更新。
基于图8a-图8b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与频谱图特征提取模块对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与声学特征提取模块对语音信号提取的声学特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征和频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合后的特征趋近于对语音信号去除噪声后的声学特征为融合方向,对提取的声学特征、频谱图特征和图像特征序列进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图9a和图9b,其中,图9a为对多模态语音识别模型进行训练的又一种架构示意图,图9b为对多模态语音识别模型进行训练的又一种实现流程图,可以包括:
步骤S91:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的声学特征(即图9a中的清晰声学特征,即无噪声声学特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图9a中的噪声频谱图特征)和声学特征(即图9a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S92:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S93:通过多模态语音识别模型对噪声语音信号的声学特征和图像特征序列进行融合,得到训练样本的第一融合特征。
步骤S94:通过多模态语音识别模型对噪声频谱图特征和图像特征序列进行融合,得到训练样本的第二融合特征。
步骤S95:通过多模态语音识别模型对噪声声学特征和训练样本的第二融合特征进行融合,得到训练样本的融合特征。
步骤S96:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S97:通过多模态语音识别模型以训练样本的第一融合特征趋近于无噪声声学特征,训练样本的融合特征趋近于无噪声声学特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算训练样本的第一融合特征和清晰声学特征的第一差异,通过第一损失函数计算训练样本的融合特征和清晰声学特征的第二差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第三差异,根据第一差异、第二差异和第三差异的加权和对多模态语音识别模型的参数进行更新。
本示例中,计算第一差异和第二差异使用的损失函数相同,在一可选的实施例中,计算第一差异和第二差异使用的损失函数也可以不同,本申请不做具体限定。
基于图9a-图9b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与频谱图特征提取模块对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征,对语音信号提取的声学特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与声学特征提取模块对语音信号提取的声学特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征和频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的声学特征,对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合得到的特征趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和图像特征序列进行融合,得到第一融合特征,对频谱图特征和图像特征序列进行融合,得到第二融合特征,对声学特征和第二融合特征进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图10a和图10b,其中,图10a为对多模态语音识别模型进行训练的又一种架构示意图,图10b为对多模态语音识别模型进行训练的又一种实现流程图,可以包括:
步骤S101:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的频谱图特征(即图10a中的清晰频谱图特征,即无噪声频谱图特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图10a中的噪声频谱图特征)和声学特征(即图10a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S102:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S103:通过多模态语音识别模型对噪声频谱图特征和图像特征序列进行融合,得到训练样本的第二融合特征。
步骤S104:通过多模态语音识别模型对噪声声学特征和图像特征序列进行融合,得到训练样本的第一融合特征。
步骤S105:通过多模态语音识别模型对噪声频谱图特征和训练样本的第一融合特征进行融合,得到训练样本的融合特征。
步骤S106:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S107:通过多模态语音识别模型以训练样本的第二融合特征趋近于无噪声频谱图特征,训练样本的融合特征趋近于无噪声频谱图特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算训练样本的第二融合特征和无噪声频谱图特征的第一差异,通过第一损失函数计算训练样本的融合特征和无噪声频谱图特征的第二差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第三差异,根据第一差异、第二差异和第三差异的加权和对多模态语音识别模型的参数进行更新。
本示例中,计算第一差异和第二差异使用的损失函数相同,在一可选的实施例中,计算第一差异和第二差异使用的损失函数也可以不同,本申请不做具体限定。
基于图10a-图10b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与频谱图特征提取模块对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与声学特征提取模块对语音信号提取的声学特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征,对语音信号提取的频谱图特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征和频谱图特征提取模块对语音信号提取的频谱图特征融合后的特 征趋近于对语音信号去除噪声后的频谱图特征,对图像序列提取的图像特征序列与频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合后的特征趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和图像特征序列进行融合,得到第二融合特征,对声学特征和图像特征序列进行融合,得到第一融合特征,对频谱图特征和第一融合特征进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图11a和图11b,其中,图11a为对多模态语音识别模型进行训练的又一种架构示意图,图11b为对多模态语音识别模型进行训练的又一种实现流程图,可以包括:
步骤S111:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的频谱图特征(即图11a中的清晰频谱图特征,即无噪声频谱图特征)和声学特征(即图11a中的清晰声学特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图11a中的噪声频谱图特征)和声学特征(即图11a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S112:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S113:通过多模态语音识别模型对噪声声学特征和图像特征序列进行融合,得到训练样本的第一融合特征。
步骤S114:通过多模态语音识别模型对噪声频谱图特征和图像特征序列进行融合,得到训练样本的第二融合特征。
步骤S115:通过多模态语音识别模型对训练样本的第一融合特征和训练样本的第二融合特征进行融合,得到训练样本的融合特征。
步骤S116:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S117:通过多模态语音识别模型以训练样本的第一融合特征趋近于无噪声声学特征,训练样本的第二融合特征趋近于无噪声频谱图特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算训练样本的第一融合特征和无噪声声学特征的第一差异,通过第一损失函数计算训练样本的第二融合特征和无噪声频谱图特征的第二差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第三差异,根据第一差异、第二差异和第三差异的加权和对多模态语音识别模型的参数进行更新。
本示例中,计算第一差异和第二差异使用的损失函数相同,在一可选的实施例中,计算第一差异和第二差异使用的损失函数也可以不同,本申请不做具体限定。
基于图11a-图11b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从语音信号中提取频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征融合后的特征趋近于对语音信号去除噪声后的声学特征,对图像序列提取的图像特征序列与频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合得到的第二融合特征趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和图像特征序列进行融合,得到第二融合特征;以融合得到的第一融合特征趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和图像特征序列进行融合,得到第一融合特征能力,还具有对第一融合特征和第二融合特征进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图12a和图12b,其中,图12a为对多模态语音识别模型进行训练的又一种架构示意图,图12b为对多模态语音识别模型进行训练的又一种实现流程图,可以包括:
步骤S121:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的频谱图特征(即图12a中的清晰频谱图特征,即无噪声频谱图特征)和声学特征(即图12a中的 清晰声学特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图12a中的噪声频谱图特征)和声学特征(即图12a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S122:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S123:通过多模态语音识别模型对噪声声学特征和图像特征序列进行融合,得到训练样本的第一融合特征。
步骤S124:通过多模态语音识别模型对噪声频谱图特征和训练样本的第一融合特征进行融合,得到训练样本的融合特征。
步骤S125:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S126:通过多模态语音识别模型以训练样本的第一融合特征趋近于无噪声声学特征,训练样本的融合特征趋近于无噪声频谱图特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算训练样本的第一融合特征和无噪声声学特征的第一差异,通过第一损失函数计算训练样本的融合特征和无噪声频谱图特征的第二差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第三差异,根据第一差异、第二差异和第三差异的加权和对多模态语音识别模型的参数进行更新。
本示例中,计算第一差异和第二差异使用的损失函数相同,在一可选的实施例中,计算第一差异和第二差异使用的损失函数也可以不同,本申请不做具体限定。
基于图12a-图12b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与频谱图特征提取模块对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征,对语音信号提取的声学特征与图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与声学特征提取模块对语音信号提取的声学特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征和频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的频谱图特征,对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合得到的第一融合特征趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和图像特征序列进行融合,得到第一融合特征,以融合后的融合特征趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和第一融合特征进行融合,得到融合特征的能力。
在一可选的实施例中,若语音信息包括声学特征和频谱图特征,请参看图13a和图13b,其中,图13a为对多模态语音识别模型进行训练的又一种架构示意图,图13b为对多模态语音识别模型进行训练的又一种实现流程图,可以包括:
步骤S131:通过多模态语音识别模型分别获取训练样本中的无噪声语音信号的频谱图特征(即图13a中的清晰频谱图特征,即无噪声频谱图特征)和声学特征(即图13a中的清晰声学特征),以及训练样本中包含上述无噪声语音信号的噪声语音信号的频谱图特征(即图13a中的噪声频谱图特征)和声学特征(即图13a中的噪声声学特征)。具体获取过程可以参看前述实施例,这里不再赘述。
步骤S132:通过多模态语音识别模型获取训练样本中的样本图像序列的样本图像特征序列。
步骤S133:通过多模态语音识别模型对噪声频谱图特征和图像特征序列进行融合,得到训练样本的第二融合特征。
步骤S134:通过多模态语音识别模型对噪声声学特征和训练样本的第二融合特征进行融合,得到训练样本的融合特征。
步骤S135:通过多模态语音识别模型对训练样本的融合特征进行语音识别,得到训练样本对应的语音识别结果。
步骤S136:通过多模态语音识别模型以训练样本的第二融合特征趋近于无噪声频谱图特征,训练样本的融合特征趋近于无噪声声学特征,训练样本对应的语音识别结果趋近于训练样本的样本标签为目标,对多模态语音识别模型的参数进行更新。
可选的,可以通过第一损失函数计算训练样本的第二融合特征和无噪声频谱图特征的第一差异,通过第一损失函数计算训练样本的融合特征和无噪声声学特征的第二差异,通过第二损失函数计算训练样本对应的语音识别结果与训练样本的样本标签的第三差异,根据第一差异、第二差异和第三差异的加权和对多模态语音识别模型的参数进行更新。
本示例中,计算第一差异和第二差异使用的损失函数相同,在一可选的实施例中,计算第一差异和第二差异使用的损失函数也可以不同,本申请不做具体限定。
基于图13a-图13b所示的多模态语音识别模型训练方法训练得到的多模态语音识别模型,声学特征提取模块具备以对语音信号提取的声学特征与频谱图特征提取模块对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征为提取方向,从语音信号中提取声学特征的能力;
频谱图特征提取模块具备以对语音信号提取的频谱图特征与声学特征提取模块对语音信号提取的声学特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的声学特征,对语音信号提取的频谱图特征和图像特征提取模块32对图像序列提取的图像特征序列融合后的特征趋近于对语音信号去除噪声后的频谱图特征的能力;
图像特征提取模块32具备以对图像序列提取的图像特征序列与声学特征提取模块对语音信号提取的声学特征和频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的声学特征,对图像序列提取的图像特征序列与频谱图特征提取模块对语音信号提取的频谱图特征融合后的特征趋近于对语音信号去除噪声后的频谱图特征为提取方向,从图像序列中提取图像特征序列的能力;
特征融合模块33具备以融合得到的第二融合特征趋近于对语音信号去除噪声后的频谱图特征为融合方向,对频谱图特征和图像特征序列进行融合,得到第二融合特征,以融合得到的融合特征趋近于对语音信号去除噪声后的声学特征为融合方向,对声学特征和第二融合特征进行融合,得到融合特征的能力。
前述各个实施例中,不对各个差异的权重进行限定,各个差异对应的权重可以相同,也可以不同。各个差异的权重可以预先设置好,也可以在多模态语音识别模型训练过程中学习得到。以图5a所示实施例为例,可选的,第一差异的权重可以为0.2,第二差异的权重可以为0.8。
可选的,第一损失函数可以为L2范数或L1范数,而第二损失函数可以为交叉熵函数。
本申请的发明人研究发现,同步采集的音/视频的数据量通常较少,只以同步采集的音/视频数据作为训练样本训练得到多模态语音识别模型可能会出现过拟合现象,为了避免过拟合现象,同时为了近一步提高多模态语音识别模型的识别准确率,在对多模态语音识别模型训练之前,可以对一些功能模块进行预训练。
在一可选的实施例中,在对多模态语音识别模型训练之前,语音信息提取模块31的声学特征提取模块的初始参数为,以语音信号及其对应的语音内容为训练数据训练好的语音识别模型中,用于对语音信号进行声学特征提取的特征提取模块的参数。
也就是说,声学特征提取模块的初始参数是利用纯语音样本训练好的语音识别模型中的特征提取模块的参数。
本申请实施例中,不对语音识别模型的具体架构进行限定,但不管语音识别模型的架构是怎样的,特征提取模块是必须的功能模块。比如,在一可选的实施例中,语音识别模型可以包括:特征提取模块,用于提取输入语音识别模型的声学特征的隐层特征;识别模块,用于根据特征提取模块提取的隐层特征进行语音识别。语音识别模型的训练过程可以参看已有的训练方法,这里不再详述。
这里用于训练语音识别模型的语音样本中可以包含用于训练上述多模态语音识别模型的语音样本,也可以不包含上述用于训练上述多模态语音识别模型的语音样本,本申请对此不做具体限定。
在一可选的实施例中,在对多模态语音识别模型训练之前,频谱图特征提取模块的初始参数为,以语音信号及其对应的频谱图标签为训练数据训练好的语音分离模型中,用于对语音信号的频谱图进行特征提取的频谱图特征提取模块的参数。
也就是说,频谱图特征提取模块的初始参数是利用纯语音样本训练好的语音分离模型中的频谱图特征提取模块的参数。
本申请实施例中,不对语音分离模型的具体架构进行限定,但不管语音分离模型的架构是怎样的,频谱图特征提取模块是必须的功能模块。比如,在一可选的实施例中,语音分离模型可以包括:频谱图特征提取模块,用于提取输入语音分离模型的频谱图的隐层特征;分离模块,用于根据频谱图特征提取模块提取的隐层特征进行语音分离。语音分离模 型的训练过程可以参看已有的训练方法,这里不再详述。
这里用于训练语音分离模型的语音样本中可以包含用于训练上述多模态语音识别模型的语音样本,也可以不包含上述于训练上述多模态语音识别模型的语音样本,本申请对此不做具体限定。
在一可选的实施例中,在对多模态语音识别模型训练之前,图像特征提取模块的初始参数为,以图像序列及其对应的发音内容为训练数据训练好的唇语识别模型中,用于对图像序列进行特征提取的图像特征提取模块的参数。
也就是说,图像特征提取模块的初始参数是利用纯图像序列样本训练好的唇语识别模型中的图像特征提取模块的参数。
本申请实施例中,不对唇语识别模型的具体架构进行限定,但不管唇语识别模型的架构是怎样的,图像特征提取模块是必须的功能模块。比如,在一可选的实施例中,唇语识别模型可以包括:图像特征提取模块,用于提取输入唇语识别模型的图像序列的隐层特征序列;识别模块,用于根据图像特征提取模块提取的隐层特征序列进行唇语识别。唇语识别模型的训练过程可以参看已有的训练方法,这里不再详述。
这里用于训练唇语识别模型的图像序列样本中可以包含用于训练上述多模态语音识别模型的图像序列样本,也可以不包含上述用于训练上述多模态语音识别模型的图像序列样本,本申请对此不做具体限定。
需要说明的是,识别模块22利用融合特征进行语音识别,得到的语音识别结果通常为音素级识别结果,比如为三音素(triphone),在得到三音素后,可以将音素通过维特比算法解码成文字序列。具体解码过程可以参已有的方法,这里不再详述。
另外,本申请实施例中,输入多模态语音识别模型的语音信号可以为从原始的语音信号中提取的声学特征和/或由原始的语音信号通过短时傅里叶变换得到的频谱图。
若多模态语音识别模型仅需要提取语音信号的声学特征,则输入多模态语音识别模型的是从原始语音信号中提取的声学特征(比如,fbank特征);以fbank特征为例,可以通过滑动窗口提取fbank特征,其中,窗长可以为25ms,帧移为10ms,即相邻两个滑动窗口位置的语音信号有15ms的重叠,滑动窗口每滑动到一个位置,提取该位置处的语音信号的40维fbank特征(当然也可以是其它维度,本申请不做具体限定)向量,这样得到的fbank特征为100fps的fbank特征向量序列。多模态语音识别模型从输入的fbank特征中提取的 特征为fbank特征的隐层特征。
若多模态语音识别模型仅需要提取语音信号的频谱图特征,则输入多模态语音识别模型的是由原始的语音信号通过短时傅里叶变换得到的频谱图;多模态语音识别模型从输入的频谱图中提取的是频谱图的隐层特征。
若多模态语音识别模型既需要提取语音信号的声学特征,又需要提取语音信号的频谱图特征,则输入多模态语音识别模型的是从原始语音信号中提取的声学特征和由原始的语音信号通过短时傅里叶变换得到的频谱图。
视频的帧率通常为25fps。为了简化多模态语音识别模型的数据处理流程,本申请实施例中,在对多模态语音识别模型进行训练之前,还对样本语音信号的文字标注进行预处理,具体可以使用forcealignment将文字发音音素对齐到语音信号上,其中,每4帧语音信号(滑动窗口每滑动到一个位置,确定一帧语音信号)对应到一个三音素(triphone)上,这样实际上文字标注被转化为triphone标注,标注帧率为25fps,是音频帧率的四分之一,与视频帧率同步。具体对齐方式可以参看已有的实现方式,这里不再赘述。
以基于图5a所示实施例为例,在模型的训练阶段,输入多模态语音识别模型的噪声语音信号可以是100fps的语音帧(为便于叙述,记为噪声语音帧,该噪声语音帧通过窗长为25ms,帧移为10ms的滑动窗口在原始噪声语音信号中进行滑动得到)的初始fbank特征向量序列(为便于叙述,记为初始噪声fbank特征向量序列),初始噪声fbank特征向量序列中的每个初始噪声fbank特征向量均为40维的特征向量。同理,输入多模态语音识别模型的无噪声语音信号可以是100fps的语音帧(为便于叙述,记为无噪声语音帧,该无噪声语音帧通过窗长为25ms,帧移为10ms的滑动窗口在原始无噪声语音信号中进行滑动得到)的初始fbank特征向量序列(为便于叙述,记为初始无噪声fbank特征向量序列),初始无噪声fbank特征向量序列中的每个初始无噪声fbank特征向量均为40维的特征向量。
初始噪声fbank特征向量序列经过声学特征提取模块后会在时间维度下采样4倍,得到25fps的512维的噪声fbank特征向量序列;初始无噪声fbank特征向量序列经过声学特征提取模块后会在时间维度下采样4倍,得到25fps的512维的无噪声fbank特征向量序列。
输入多模态语音识别模型的图像序列可以是25fps的图像序列,图像大小为80×80的RGB三通道图像,经过图像特征提取模块后得到25fps的512维的图像特征向量序列。
25fps的512维的噪声fbank特征向量序列和25fps的512维的图像特征向量序列输入特征融合模块,特征融合模块每接收一个噪声fbank特征向量和一个图像特征向量,将该 噪声fbank特征向量和该图像特征向量进行融合(如,将噪声fbank特征向量和图像特征向量进行拼接),再通过一个小的融合神经网络,生成512维的融合特征向量,该512维的融合特征向量输出到识别模块。
识别模块经过softmax分类将接收到的512维的融合特征向量进行音素识别,得到三音素识别结果。
本示例中,用于对多模态语音识别模型的参数进行更新的损失函数由两部分构成:为了显式表达图像信息对高噪声语音信息的降噪功能,将512维的融合特征向量与对应的512维的无噪声fbank特征向量做L2范数作为损失函数的一部分,使得融合后的特征向量与对应的512维的无噪声fbank特征向量更接近,从而起到在特征层面上降噪约束效果。同时,计算识别模块的识别结果与三音素标签的交叉熵函数作为损失函数的另一部分。
在多模态语音识别模型的训练或使用阶段,输入多模态语音识别模型的语音信号可以是100fps的语音帧的初始fbank特征向量序列;初始fbank特征向量序列经过声学特征提取模块后会在时间维度下采样4倍,得到25fps的512维的fbank特征向量序列;输入多模态语音识别模型的图像序列可以是25fps的图像序列,图像大小为80×80的RGB三通道图像,经过图像特征提取模块后得到25fps的512维的图像特征向量序列;25fps的512维的fbank特征向量序列和25fps的512维的图像特征向量序列输入特征融合模块,特征融合模块每接收一个fbank特征向量和一个图像特征向量,将该fbank特征向量和该图像特征向量进行融合,生成512维的融合特征向量,该512维的融合特征向量输出到识别模块。
识别模块经过softmax分类将接收到的512维的融合特征向量进行音素识别,得到三音素识别结果。
此外,本申请的发明人研究发现,目前的借助唇部动作视频协助进行语音识别的多模态语音识别方法,对训练数据集极其敏感,比如,如果训练集中大部分数据为英文数据,少量为中文数据,唇部动作信息的加入可能使高噪声下的中文识别成英文,反而降低了语音识别效果。
而由于降噪本身是与语种无关的,因而基于本申请的方案能够显著缓解训练数据集语种不均衡带来的识别混乱问题,进一步提升了高噪声环境下的多模态语音识别效果。
也就是说,本申请的多模态语音识别模型对训练集的依赖性较低,即便训练数据集中样本的语种分布不均匀,训练好的多模态语音识别模型也可以准确进行多语种(可识别的语种为训练样本集中包含的语种)的语音识别,大大减轻了识别混乱问题。
因而,基于本申请的方案,训练上述多模态语音识别模型所使用的训练样本集合中,可以仅包含单一语种的训练样本,也可以包含两种或多种语种的训练样本。当训练样本集合中包含两种或多种语种的训练样本时,训练样本集合中各个语种的训练样本所占的比例随机确定,或为预置比例。
如表1所示,为基于本申请公开的方案(具体为图5a所示实施例)与现有技术中的语音识别效果的对比。这里进行测试的测试集以英文语料为主,中文语料只有一小部分。
表1
Figure PCTCN2020087115-appb-000001
从表1可以看出,如果单纯对语音信号进行处理实现语音识别(即表1中的单语音识别网络),不管是清晰语音还是高噪声语音,识别错误率都较高。
而在语音识别过程中加入唇部动作视频辅助语音识别(即表1中的已有的多模态识别网络)后,清晰语音和高噪声语音的识别错误率均降低了。
而基于本申请的方案,在多模态语音识别过程中加入降噪的思想后,清晰语音和高噪声语音的识别错误率进一步降低了。
与方法实施例相对应,本申请实施例还提供一种语音识别装置,本申请实施例提供的语音识别装置的一种结构示意图如图14所示,可以包括:
获取模块141,特征提取模块142和识别模块143;其中,
获取模块141用于获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;
特征提取模块142用于以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
识别模块143用于利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
本申请实施例提供的语音识别装置,在获取语音信号和图像序列的融合特征时,是以融合信息趋近于对语音信号去噪后的语音信息为获取方向的,即所获得到的融合信息趋近于无噪声语音信号的语音信息,降低了语音信号中的噪声对语音识别的干扰,从而提高语音识别率。
在一可选的实施例中,特征提取模块142和识别模块143的功能可以通过多模态语音识别模型实现,具体的:
特征提取模块142具体可以用于:通过多模态语音识别模型以趋近于对所述语音信号去除噪声后的信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
识别模块143具体可以用于:通过多模态语音识别模型利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果的能力。
在一可选的实施例中,特征提取模块142具体可以用于:以趋近于对所述语音信号去除噪声后的语音信息为获取方向,利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,利用所述多模态语音识别模型的图像特征提取模块从所述图像序列中提取图像特征序列;利用所述多模态语音识别模型的特征融合模块对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征;
识别模块143具体可以用于:利用多模态语音识别模型的识别模块,基于所述融合特征进行语音识别,得到所述语音信号的语音识别结果。
在一可选的实施例中,特征提取模块142具体可以包括:
提取模块,用于以对所述语音信号提取的语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的语音信息为提取方向,利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,利用所述多模态语音识别模型的图像特征提取模块从所述图像序列中提取图像特征序列;
融合模块,用于以趋近于对所述语音信号去除噪声后的语音信息为融合方向,利用所述多模态语音识别模型的特征融合模块,对所述语音信息和所述图像特征序列进行融合,得到融合特征。
在一可选的实施例中,所述语音信息为N种,所述N为大于或等于1的正整数;提取 模块在利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息时,具体用于:
利用所述多模态语音识别模型的语音信息提取模块,以提取的N种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的一种语音信息为提取方向,从所述语音信号中提取N种语音信息;或者,
若所述N大于1,则利用所述多模态语音识别模型的语音信息提取模块,以提取的每一种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的该种语音信息为提取方向,从所述语音信号中提取N种语音信息。
在一可选的实施例中,所述语音信息为声学特征和/或频谱图特征,所述融合模块具体可以用于:
根据如下三种融合方式中的任意一种或任意两种的组合得到的融合特征获取融合所述语音信号和所述图像序列的融合特征:
融合方式一:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的声学特征为融合方向,对所述声学特征和所述图像特征序列进行融合,得到融合方式一对应的融合特征;
融合方式二:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的频谱图特征为融合方向,对所述频谱图特征和所述图像特征序列进行融合,得到融合方式二对应的融合特征;
融合方式三:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的声学特征或频谱图特征为融合方向,对所述声学特征、所述频谱图特征和所述图像特征序列进行融合,得到融合方式三对应的融合特征。
在一可选的实施例中,所述语音识别装置还包括训练模块,用于:
通过所述多模态语音识别模型分别获取训练样本中的无噪声语音信号的无噪声语音信息,和所述训练样本中包含所述无噪声语音信号的噪声语音信号的噪声语音信息;
通过所述多模态语音识别模型获取所述训练样本中的样本图像序列的样本图像特征序列;
通过所述多模态语音识别模型将所述噪声语音信息和所述样本图像特征序列进行融合,得到所述训练样本的融合特征;
通过所述多模态语音识别模型利用所述训练样本的融合特进行语音识别,得到所述训练样本对应的语音识别结果;
通过所述多模态语音识别模型以所述训练样本的融合特征趋近于所述无噪声语音信息,所述训练样本对应的语音识别结果趋近于所述训练样本的样本标签为目标,对所述多模态语音识别模型的参数进行更新。
在一可选的实施例中,在训练多模态语音识别模型之前,所述声学特征提取模块的初始参数为,以语音信号及其对应的语音内容为训练数据训练好的语音识别模型中,用于对语音信号进行声学特征提取的特征提取模块的参数。
在一可选的实施例中,在训练多模态语音识别模型之前,所述频谱图特征提取模块的初始参数为,以语音信号及其对应的频谱图标签为训练数据训练好的语音分离模型中,用于对语音信号的频谱图进行特征提取的频谱图特征提取模块的参数。
在一可选的实施例中,在训练多模态语音识别模型之前,所述图像特征提取模块的初始参数为,以图像序列及其对应的发音内容为训练数据训练好的唇语识别模型中,用于对图像序列进行特征提取的图像特征提取模块的参数。
在一可选的实施例中,训练所述多模态语音识别模型所使用的训练样本集合中,包括不同语种的训练样本,所述训练样本集合中各个语种的训练样本所占的比例随机确定,或为预置比例。
本申请实施例提供的语音识别装置可应用于语音识别设备,如PC终端、云平台、服务器及服务器集群等。可选的,图15示出了语音识别设备的硬件结构框图,参照图15,语音识别设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(ApplicationSpecific IntegratedCircuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:
获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;
以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:
获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;
以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储 在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (15)

  1. 一种语音识别方法,其特征在于,包括:
    获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;
    以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
    利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
  2. 根据权利要求1所述的方法,其特征在于,获取融合信息,利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果的过程,包括:
    利用多模态语音识别模型处理所述语音信号和所述图像序列,得到所述多模态语音识别模型输出的语音识别结果;
    其中,所述多模态语音识别模型具备以趋近于对所述语音信号去除噪声后的信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果的能力。
  3. 根据权利要求2所述的方法,其特征在于,所述利用多模态语音识别模型处理所述语音信号和所述图像序列,得到所述多模态语音识别模型输出的语音识别结果,包括:
    以趋近于对所述语音信号去除噪声后的语音信息为获取方向,利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,利用所述多模态语音识别模型的图像特征提取模块从所述图像序列中提取图像特征序列;利用所述多模态语音识别模型的特征融合模块对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征;
    利用多模态语音识别模型的识别模块,基于所述融合特征进行语音识别,得到所述语音信号的语音识别结果。
  4. 根据权利要求3所述的方法,其特征在于,所述语音信息为N种,所述N为大于或等于1的正整数;所述利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,包括:
    利用所述多模态语音识别模型的语音信息提取模块,以提取的N种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的一种语音信息为提取方向,从所述语音信号中提取N种语音信息;或者,
    若所述N大于1,则利用所述多模态语音识别模型的语音信息提取模块,以提取的每一种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的该种语音信息为提取方向,从所述语音信号中提取N种语音信息。
  5. 根据权利要求4所述的方法,其特征在于,所述语音信息为声学特征和/或频谱图特征,所述以趋近于对所述语音信号去除噪声后的语音信息为融合方向,利用所述多模态语音识别模型的特征融合模块,对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征,包括:
    根据如下三种融合方式中的任意一种或任意两种的组合得到的融合特征获取融合所述语音信号和所述图像序列的融合特征:
    融合方式一:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的声学特征为融合方向,对所述声学特征和所述图像特征序列进行融合,得到融合方式一对应的融合特征;
    融合方式二:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的频谱图特征为融合方向,对所述频谱图特征和所述图像特征序列进行融合,得到融合方式二对应的融合特征;
    融合方式三:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的声学特征或频谱图特征为融合方向,对所述声学特征、所述频谱图特征和所述图像特征序列进行融合,得到融合方式三对应的融合特征。
  6. 根据权利要求2所述的方法,其特征在于,所述多模态语音识别模型的训练过程包括:
    分别获取训练样本中的无噪声语音信号的无噪声语音信息,和所述训练样本中包含所述无噪声语音信号的噪声语音信号的噪声语音信息;
    获取所述训练样本中的样本图像序列的样本图像特征序列;
    将所述噪声语音信息和所述样本图像特征序列进行融合,得到所述训练样本的融合特征;
    利用所述训练样本的融合特进行语音识别,得到所述训练样本对应的语音识别结果;
    以所述训练样本的融合特征趋近于所述无噪声语音信息,所述训练样本对应的语音识别结果趋近于所述训练样本的样本标签为目标,对所述多模态语音识别模型的参数进行更新。
  7. 根据权利要求6所述的方法,其特征在于,分别获取无噪声语音信息和噪声语音信息的过程,包括:
    利用所述多模态语音识别模型中的声学特征提取模块获取所述无噪声语音信号的无噪声声学特征和所述噪声语音信号的噪声声学特征;和/或,利用所述多模态语音识别模型中的频谱图特征提取模块获取所述无噪声语音信号的无噪声频谱图特征和所述噪声语音信号的噪声频谱图特征;
    所述声学特征提取模块的初始参数为,以语音信号及其对应的语音内容为训练数据训练好的语音识别模型中,用于对语音信号进行声学特征提取的特征提取模块的参数;
    所述频谱图特征提取模块的初始参数为,以语音信号及其对应的频谱图标签为训练数据训练好的语音分离模型中,用于对语音信号的频谱图进行特征提取的频谱图特征提取模块的参数。
  8. 根据权利要求6所述的方法,其特征在于,所述获取所述训练样本中的样本图像序列的样本图像特征序列,包括:
    利用所述多模态语音识别模型中的图像特征提取模块获取所述样本图像序列的样本图像特征序列;
    所述图像特征提取模块的初始参数为,以图像序列及其对应的发音内容为训练数据训练好的唇语识别模型中,用于对图像序列进行特征提取的图像特征提取模块的参数。
  9. 根据权利要求6所述的方法,其特征在于,训练所述多模态语音识别模型所使用的训练样本集合中,包括不同语种的训练样本,所述训练样本集合中各个语种的训练样本所占的比例随机确定,或为预置比例。
  10. 一种语音识别装置,其特征在于,包括:
    获取模块,用于获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;
    特征提取模块,用于以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
    识别模块,用于利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
  11. 根据权利要求10所述的装置,其特征在于,所述特征提取模块具体用于:通过多模态语音识别模型以趋近于对所述语音信号去除噪声后的信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;
    所述识别模块具体用于:通过所述多模态语音识别模型利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
  12. 根据权利要求11所述的装置,其特征在于,所述特征提取模块具体用于:以趋近于对所述语音信号去除噪声后的语音信息为获取方向,利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,利用所述多模态语音识别模型的图像特征提取模块从所述图像序列中提取图像特征序列;利用所述多模态语音识别模型的特征融合模块,对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征;
    所述识别模块具体用于:利用所述多模态语音识别模型的识别模块,基于所述融合特征进行语音识别,得到所述语音信号的语音识别结果。
  13. 根据权利要求12所述的装置,其特征在于,所述语音信息为N种,所述N为大于或等于1的正整数;所述提取模块在利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息时,具体用于:
    利用所述多模态语音识别模型的语音信息提取模块,以提取的N种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的一种语音信息为提取方向,从所述语音信号中提取N种语音信息;或者,
    若所述N大于1,则利用所述多模态语音识别模型的语音信息提取模块,以提取的每一种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的该种语音信息为提取方向,从所述语音信号中提取N种语音信息。
  14. 一种语音识别设备,其特征在于,包括存储器和处理器;
    所述存储器,用于存储程序;
    所述处理器,用于执行所述程序,实现如权利要求1-9中任一项所述的语音识别方法的各个步骤。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1-9中任一项所述的语音识别方法的各个步骤。
PCT/CN2020/087115 2020-02-28 2020-04-27 语音识别方法、装置、设备及存储介质 WO2021169023A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010129952.9A CN111312217A (zh) 2020-02-28 2020-02-28 语音识别方法、装置、设备及存储介质
CN202010129952.9 2020-02-28

Publications (1)

Publication Number Publication Date
WO2021169023A1 true WO2021169023A1 (zh) 2021-09-02

Family

ID=71159496

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087115 WO2021169023A1 (zh) 2020-02-28 2020-04-27 语音识别方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111312217A (zh)
WO (1) WO2021169023A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883130A (zh) * 2020-08-03 2020-11-03 上海茂声智能科技有限公司 一种融合式语音识别方法、装置、系统、设备和存储介质
CN112786052B (zh) * 2020-12-30 2024-05-31 科大讯飞股份有限公司 语音识别方法、电子设备和存储装置
CN113470617B (zh) * 2021-06-28 2024-05-31 科大讯飞股份有限公司 语音识别方法以及电子设备、存储装置
CN117116253B (zh) * 2023-10-23 2024-01-12 摩尔线程智能科技(北京)有限责任公司 初始模型的训练方法、装置、语音识别方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389097A (zh) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 一种人机交互装置及方法
CN106328156A (zh) * 2016-08-22 2017-01-11 华南理工大学 一种音视频信息融合的麦克风阵列语音增强系统及方法
US20170236516A1 (en) * 2016-02-16 2017-08-17 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation System and Method for Audio-Visual Speech Recognition
CN109814718A (zh) * 2019-01-30 2019-05-28 天津大学 一种基于Kinect V2的多模态信息采集系统
CN110111783A (zh) * 2019-04-10 2019-08-09 天津大学 一种基于深度神经网络的多模态语音识别方法
CN110503957A (zh) * 2019-08-30 2019-11-26 上海依图信息技术有限公司 一种基于图像去噪的语音识别方法及装置
CN110544479A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种去噪的语音识别方法及装置
CN110545396A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种基于定位去噪的语音识别方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face
CN109147763B (zh) * 2018-07-10 2020-08-11 深圳市感动智能科技有限公司 一种基于神经网络和逆熵加权的音视频关键词识别方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389097A (zh) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 一种人机交互装置及方法
US20170236516A1 (en) * 2016-02-16 2017-08-17 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation System and Method for Audio-Visual Speech Recognition
CN106328156A (zh) * 2016-08-22 2017-01-11 华南理工大学 一种音视频信息融合的麦克风阵列语音增强系统及方法
CN109814718A (zh) * 2019-01-30 2019-05-28 天津大学 一种基于Kinect V2的多模态信息采集系统
CN110111783A (zh) * 2019-04-10 2019-08-09 天津大学 一种基于深度神经网络的多模态语音识别方法
CN110503957A (zh) * 2019-08-30 2019-11-26 上海依图信息技术有限公司 一种基于图像去噪的语音识别方法及装置
CN110544479A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种去噪的语音识别方法及装置
CN110545396A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 一种基于定位去噪的语音识别方法及装置

Also Published As

Publication number Publication date
CN111312217A (zh) 2020-06-19

Similar Documents

Publication Publication Date Title
WO2021169023A1 (zh) 语音识别方法、装置、设备及存储介质
Zhou et al. Modality attention for end-to-end audio-visual speech recognition
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
US7454342B2 (en) Coupled hidden Markov model (CHMM) for continuous audiovisual speech recognition
US7472063B2 (en) Audio-visual feature fusion and support vector machine useful for continuous speech recognition
US9123347B2 (en) Apparatus and method for eliminating noise
Kinnunen et al. Voice activity detection using MFCC features and support vector machine
TW502249B (en) Segmentation approach for speech recognition systems
JP6501260B2 (ja) 音響処理装置及び音響処理方法
JP4220449B2 (ja) インデキシング装置、インデキシング方法およびインデキシングプログラム
US10748544B2 (en) Voice processing device, voice processing method, and program
WO2014117547A1 (en) Method and device for keyword detection
JP6464005B2 (ja) 雑音抑圧音声認識装置およびそのプログラム
WO2022121155A1 (zh) 基于元学习的自适应语音识别方法、装置、设备及介质
JP2011191423A (ja) 発話認識装置、発話認識方法
Almajai et al. Using audio-visual features for robust voice activity detection in clean and noisy speech
CN111554279A (zh) 一种基于Kinect的多模态人机交互系统
Saenko et al. Articulatory features for robust visual speech recognition
WO2023035969A1 (zh) 语音与图像同步性的衡量方法、模型的训练方法及装置
CN111027675B (zh) 一种多媒体播放设置自动调节方法及系统
Maka et al. An analysis of the influence of acoustical adverse conditions on speaker gender identification
JP2011033879A (ja) サンプルを用いずあらゆる言語を識別可能な識別方法
Bagi et al. Improved recognition rate of language identification system in noisy environment
Karpagavalli et al. A hierarchical approach in tamil phoneme classification using support vector machine
Verma et al. Performance analysis of speaker identification using gaussian mixture model and support vector machine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921398

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921398

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.02.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20921398

Country of ref document: EP

Kind code of ref document: A1