WO2021169023A1 - 语音识别方法、装置、设备及存储介质 - Google Patents
语音识别方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2021169023A1 WO2021169023A1 PCT/CN2020/087115 CN2020087115W WO2021169023A1 WO 2021169023 A1 WO2021169023 A1 WO 2021169023A1 CN 2020087115 W CN2020087115 W CN 2020087115W WO 2021169023 A1 WO2021169023 A1 WO 2021169023A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- fusion
- voice
- image
- sequence
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 95
- 230000004927 fusion Effects 0.000 claims abstract description 502
- 238000000605 extraction Methods 0.000 claims description 409
- 238000012549 training Methods 0.000 claims description 274
- 239000000284 extract Substances 0.000 claims description 68
- 230000008569 process Effects 0.000 claims description 31
- 238000007500 overflow downdraw method Methods 0.000 claims description 27
- 238000013459 approach Methods 0.000 claims description 21
- 238000000926 separation method Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 238000001228 spectrum Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims 1
- 230000006870 function Effects 0.000 description 52
- 239000013598 vector Substances 0.000 description 44
- 238000010586 diagram Methods 0.000 description 29
- 230000000694 effects Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 230000009467 reduction Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000037433 frameshift Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Definitions
- This application relates to the field of natural language processing technology, and more specifically, to a speech recognition method, device, equipment, and storage medium.
- the traditional speech recognition technology is single speech recognition, that is, the recognition result is obtained by processing only the speech signal.
- This speech recognition method has been able to achieve a high recognition effect in an environment with clear speech.
- the recognition rate of traditional speech recognition technology will drop rapidly.
- the existing multi-modal speech recognition method uses lip motion video for lip recognition, and then determines the final speech recognition result according to the lip recognition result and the accuracy of the single speech recognition result, and its speech recognition effect is still low.
- the present application provides a voice recognition method, device, equipment, and storage medium to improve the recognition rate of the multi-modal voice recognition method.
- a method of speech recognition includes:
- the image in the image sequence is an image of a region related to lip movement
- a speech recognition device includes:
- the acquisition module is used to acquire a voice signal and an image sequence collected synchronously with the voice signal; the image in the image sequence is an image of a region related to lip movement;
- the feature extraction module is configured to use the voice information that tends to remove the noise from the voice signal as the acquisition direction, and obtain the information that merges the voice signal and the image sequence as the fusion information;
- the recognition module is used to perform voice recognition using the fusion information to obtain the voice recognition result of the voice signal.
- a speech recognition device including a memory and a processor
- the memory is used to store programs
- the processor is configured to execute the program to implement each step of the voice recognition method described in any one of the above.
- a computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, each step of the speech recognition method as described in any one of the above is realized.
- the voice recognition method, device, equipment, and storage medium provided by the embodiments of the present application, after acquiring the voice signal and the image sequence collected synchronously with the voice signal, tend to remove noise from the voice signal.
- the latter voice information is the acquisition direction, and the information of the fusion voice signal and image sequence is obtained as fusion information; the fusion information is used for speech recognition, and the speech recognition result of the speech signal is obtained.
- the fusion information approaches the speech information after the speech signal is denoised as the acquisition direction, that is, the obtained fusion information
- the voice information that is close to the noise-free voice signal reduces the interference of the noise in the voice signal on the voice recognition, thereby improving the voice recognition rate.
- FIG. 1 is a flowchart of an implementation of the speech recognition method disclosed in an embodiment of the application
- FIG. 2 is a schematic diagram of a structure of a multi-modal speech recognition model disclosed in an embodiment of the application;
- FIG. 3 is a schematic structural diagram of a fusion feature acquisition module disclosed in an embodiment of this application.
- FIG. 4a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of this application;
- FIG. 4b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of the application
- FIG. 5a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application
- FIG. 5b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of the application
- Fig. 6a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application;
- FIG. 6b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of this application;
- Fig. 7a is a schematic diagram of an architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application;
- FIG. 7b is a flowchart of an implementation of training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 8a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 8b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 9a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 9b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 10a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 10b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 11a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
- Fig. 11b is another implementation flow chart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 12a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 12b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 13a is a schematic diagram of another architecture for training a multi-modal speech recognition model disclosed in an embodiment of the application.
- FIG. 13b is another implementation flowchart of training a multi-modal speech recognition model disclosed in an embodiment of this application.
- FIG. 14 is a schematic structural diagram of a speech recognition device disclosed in an embodiment of this application.
- Fig. 15 is a block diagram of the hardware structure of a speech recognition device disclosed in an embodiment of the application.
- the inventor of the present application found that the current multi-modal speech recognition method that assists speech recognition with the help of lip motion videos uses the accuracy of the lip recognition result to compare with the accuracy of the single speech recognition result, and the accuracy is high.
- the result is regarded as the final speech recognition result, thereby improving the speech recognition rate to a certain extent.
- the essence of the multi-model speech recognition method is the ability of lip language recognition results to correct speech recognition results. It does not explore the ability of video signals to correct high-noise speech signals, so it is difficult to obtain high-quality recognition results.
- the basic idea of this application is to explicitly add the idea of noise reduction to the multi-modal speech recognition task, so as to better extract the correction effect of video information on speech information , To achieve a better recognition effect.
- FIG. 1 An implementation flowchart of the speech recognition method provided by the embodiment of the present application is shown in FIG. 1, which may include:
- Step S11 Acquire a voice signal and an image sequence collected synchronously with the voice signal; the image in the image sequence is an image of a region related to lip movement.
- the face video of the speaker is also collected.
- the above-mentioned image sequence is an image sequence obtained by cutting out the lip motion-related regions of each frame of the image in the face video of the speaker.
- a region of a fixed size for example, 80 ⁇ 80
- 80 ⁇ 80 a region of a fixed size
- the lip movement-related area may refer to only the lip area; or,
- Lip movement-related areas can be the lips and their surrounding areas, such as the lips and chin area; or,
- the lip movement-related area can be the entire face area.
- Step S12 Taking the voice information that is close to the noise removal of the voice signal as the obtaining direction, obtain the information of the fused voice signal and the image sequence as the fused information.
- the voice information after removing noise from the voice signal may refer to information extracted from the noise-reduced voice signal obtained by performing noise removal processing on the voice signal.
- the fusion information that is close to the voice information in the noise-reducing voice signal is obtained, which is equivalent to performing noise-reduction processing on the voice signal.
- Step S13 Use the fusion information to perform voice recognition to obtain a voice recognition result of the voice signal.
- the use of fusion information for voice recognition reduces the interference of the noise in the voice signal on the voice recognition, thereby improving the accuracy of the voice recognition.
- a multi-modal speech recognition model may be used to obtain the fusion information, and the fusion information may be used for speech recognition to obtain the speech recognition result of the speech signal.
- the multi-modal speech recognition model can be used to process speech signals and image sequences to obtain the speech recognition results output by the multi-modal speech recognition model;
- the multi-modal speech recognition model has the acquisition direction of information that is close to the noise removal of the speech signal, and obtains the information of the fused speech signal and image sequence as the fusion information; the fusion information is used for speech recognition to obtain the speech signal The ability of speech recognition results.
- a schematic structural diagram of the multi-modal speech recognition model provided by this embodiment of the application may include:
- the fusion feature acquisition module 21 is used to acquire the fusion feature of the fusion voice signal and the image sequence by taking the voice information that tends to remove the noise from the voice signal as the acquisition direction.
- the recognition module 22 is configured to perform voice recognition based on the fusion feature acquired by the fusion feature acquisition module 21 to obtain a voice recognition result of the voice signal.
- the foregoing process of using the multi-modal speech recognition model to process the speech signal and image sequence to obtain the speech recognition result output by the multi-modal speech recognition model can be as follows:
- fusion feature acquisition module 21 of the multi-modal speech recognition model uses the fusion feature acquisition module 21 of the multi-modal speech recognition model to obtain the fusion feature of the fusion voice signal and image sequence with the voice information approaching the noise removal of the voice signal as the acquisition direction;
- speech recognition is performed based on the fusion feature acquired by the fusion feature acquisition module 21 to obtain the speech recognition result of the speech signal.
- FIG. 3 a schematic structural diagram of the fusion feature acquisition module 21 is shown in FIG. 3, which may include:
- Voice information extraction module 31 image feature extraction module 32 and feature fusion module 33; among them,
- the speech information extraction module 31 is used to take the speech information extracted from the speech signal and the image feature sequence extracted by the image feature extraction module 32 after the fusion of the feature of the image sequence is close to the speech information after removing the noise from the speech signal as the extraction direction, from Extract voice information from the voice signal.
- the voice information extraction module 31 when the voice information extraction module 31 extracts voice information from the voice signal, the voice information extracted from the voice signal is fused with the image feature sequence extracted by the image feature extraction module 32 from the image sequence.
- the voice information after denoising the voice signal is the extraction direction, and the voice information is extracted from the voice signal.
- the image feature extraction module 32 is used to take the image feature sequence extracted from the image sequence and the voice information extracted by the voice information extraction module 31, and the feature after the fusion of the voice information extracted from the voice signal is close to the voice information after removing the noise from the voice signal as the extraction direction, from Extract the image feature sequence from the image sequence.
- the image feature extraction module 32 when the image feature extraction module 32 extracts the image feature sequence from the image sequence, the image feature sequence extracted from the image sequence is combined with the voice information extracted by the voice information extraction module 31 from the voice signal. Approaching to the speech information after denoising the speech signal is the extraction direction, and the image feature sequence is extracted from the image sequence.
- the feature fusion module 33 is used for fusing the extracted voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction to obtain the fusion feature.
- the fusion feature is close to the voice information after the noise is removed from the voice signal as the fusion direction, and the extracted voice signal and image feature sequence Perform fusion.
- the features after the fusion of the extracted voice information and image feature sequence are close to the right
- the voice information after the noise is removed from the voice signal is executed for the direction.
- the aforementioned use of the aforementioned fusion feature acquisition module 21 takes the voice information after noise removal from the voice signal as the acquisition direction, and an implementation method for acquiring the fusion features of the fused voice signal and image sequence can be for:
- voice information extraction module 31 Take the voice information that tends to remove the noise from the voice signal as the acquisition direction, use the voice information extraction module 31 to extract the voice information from the voice signal, use the image feature extraction module 32 to extract the image feature sequence from the image sequence; use the feature fusion module 33
- the voice information extracted by the voice information extraction module 31 and the image feature sequence extracted by the image feature extraction module 32 are fused to obtain the fusion feature of the fused voice signal and image sequence.
- the voice information extraction module 31 is used to extract the voice from the voice signal Information
- the image feature extraction module 32 is used to extract the image feature sequence from the image sequence.
- the feature fusion module 33 is used to fuse the extracted voice information and the image feature sequence to obtain the fusion feature.
- the voice information extracted from the voice signal may be N types, and N is a positive integer greater than or equal to 1. Then, the foregoing process of extracting voice information from the voice signal by using the voice information extraction module 31 may include any one of the following two extraction methods:
- Extraction method 1 Use the voice information extraction module 31 to fuse the N types of voice information extracted with the image feature extraction module 32 from the image feature sequence extracted from the image sequence.
- the feature after the fusion is close to a voice after noise removal from the voice signal
- the information is the extraction direction, and N types of voice information are extracted from the voice signal.
- the fusion feature is close to a voice information after the noise is removed from the voice signal as the extraction direction.
- the specific implementation of the first extraction method can be:
- the voice information of the target type extracted by the voice information extraction module 31 is fused with the image feature sequence extracted by the image feature extraction module 32 from the image sequence.
- the feature after the fusion is close to the voice information of the target type after the noise is removed from the voice signal.
- the voice information of the target category is extracted from the voice signal.
- the specific implementation of the first extraction method can be:
- the voice information extraction module 31 uses the voice information extraction module 31, the feature after the fusion of the extracted N types of voice information and the image feature sequence extracted from the image sequence is close to one of the voice information after the noise is removed from the voice signal as the extraction direction, from the voice signal Extract N kinds of voice information.
- the extraction direction for extraction for example, suppose that there are two types of extracted voice information, namely type A voice information and type B voice information, then in the embodiment of the present application,
- the voice information extraction module 31 can be used to extract the A-type voice information and the B-type voice information and the image feature extraction module 32.
- the image feature sequence extracted from the image sequence is fused.
- the voice information is the extraction direction, and the A-type voice information and the B-type voice information are extracted from the voice signal.
- the voice information extraction module 31 can be used to extract the A-type voice information and the B-type voice information and the image feature extraction module 32.
- the image feature sequence extracted from the image sequence is fused.
- the voice information is the extraction direction, and the A-type voice information and the B-type voice information are extracted from the voice signal.
- Extraction method 2 If N is greater than 1, the voice information extraction module 31 is used to extract each voice information and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after fusion is close to the removal of the voice signal This kind of speech information after noise is the extraction direction, and N kinds of speech information are extracted from the speech signal.
- the voice information extracted by the voice information extraction module 31 and the image feature sequence fused with features that are close to the voice information after removing noise from the voice signal is taken as the extraction direction , Extract N kinds of voice information from the voice signal.
- the fusion of the voice information and the image feature sequence includes: the voice information is only fused with the image feature sequence. Or, the voice information, the image feature sequence, and the fusion features of other extracted voice information are fused.
- the voice information extracted from the voice signal may be only acoustic features (for example, fbank feature, or Mel frequency cepstral coefficient MFCC feature), or may be only spectrogram feature, or , Can include acoustic features and spectrogram features.
- acoustic features for example, fbank feature, or Mel frequency cepstral coefficient MFCC feature
- spectrogram feature for example, fbank feature, or Mel frequency cepstral coefficient MFCC feature
- the fusion direction is the voice information that tends to remove the noise from the voice signal
- the feature fusion module 33 is used to fuse the voice information and the image feature sequence
- the process of obtaining the fusion feature of the fused voice signal and image sequence may include:
- Fusion method 1 Use the feature fusion module 33 to fuse the extracted acoustic features and the image feature sequence with the acoustic features that are close to the denoising of the speech signal as the fusion direction to obtain the fusion features corresponding to the fusion method;
- Fusion method 2 Use the feature fusion module 33 to fuse the extracted spectrogram features and the image feature sequence with the spectrogram feature after denoising the speech signal as the fusion direction to obtain the fusion feature corresponding to the fusion method two;
- Fusion method 3 Use the feature fusion module 33 to fuse the extracted acoustic features, spectrogram features, and image feature sequences with the acoustic features or spectrogram features approaching the denoising of the speech signal as the fusion direction to obtain the fusion method Three corresponding fusion features.
- the fusion feature corresponding to this fusion method is the fusion feature of the fusion voice signal and the image sequence.
- the fusion feature corresponding to the fusion method 1 is the fusion feature of the fusion voice signal and the image sequence; if the fusion feature is obtained according to the fusion method 2.
- the fusion feature acquires the fusion feature of the fusion voice signal and the image sequence, then the fusion feature corresponding to the above fusion method two is the fusion feature of the fusion voice signal and the image sequence; the same is true, if the fusion voice signal is obtained according to the fusion feature obtained by the fusion method three With the fusion feature of the image sequence, the fusion feature corresponding to the above-mentioned fusion method three is the fusion feature of the fusion voice signal and the image sequence.
- the fusion feature corresponding to the fusion method 1 and the fusion feature corresponding to the fusion method 2 are fused to obtain the fused voice signal and image
- the fusion feature corresponding to the fusion mode three is the fusion feature of the fusion voice signal and image sequence.
- the process of extracting the voice information and obtaining the fusion feature of the fused voice signal and image sequence will be explained by taking the voice information as the acoustic feature and/or the spectrogram feature as an example.
- the voice information extraction module 31 when used to extract the voice information of the target type from the voice signal, it can be specifically used for:
- the voice information extraction module 31 uses the voice information extraction module 31 to extract the acoustic features extracted from the voice signal and the image feature sequence extracted from the image feature sequence by the image feature extraction module 32 as the extraction direction. Extract acoustic features from the speech signal.
- the acoustic feature extraction module of the voice information extraction module 31 can be used to extract acoustic features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module for fusing the acoustic features extracted from the voice signal with the image feature sequence extracted by the image feature extraction module 32 from the image sequence. The feature is close to the acoustic feature after the noise is removed from the speech signal, which is the extraction direction, and the acoustic feature is extracted from the speech signal.
- the voice signal input to the multi-modal voice recognition model may be an acoustic feature extracted from the original voice signal (that is, the voice signal collected by the audio collection device) through a sliding window (for ease of description, it is recorded as the initial acoustic feature),
- the acoustic feature extracted by the voice information extraction module 31 from the voice signal may be a hidden layer feature of the initial acoustic feature.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction to obtain the fusion feature.
- the feature fusion module 33 is used to fuse the extracted acoustic features and the image feature sequence to obtain the fusion feature of the fused voice signal and the image sequence by taking the acoustic feature that tends to remove noise from the voice signal as the fusion direction.
- the voice information of the target type is a spectrogram feature
- the voice information extraction module 31 when using the voice information extraction module 31 to extract the voice information of the target type from the voice signal, it may specifically include:
- the voice information extraction module 31 is used to extract the spectrogram features extracted from the voice signal and the image feature extraction module 32.
- the image feature sequence extracted from the image sequence is fused, and the feature is close to the spectrogram feature after the noise is removed from the voice signal.
- the spectrogram feature extraction module of the voice information extraction module 31 can be used to extract the spectrogram feature from the voice signal.
- the voice information extraction module 31 includes a spectrogram feature extraction module for fusing the spectrogram feature extracted from the voice signal with the image feature sequence extracted from the image sequence by the image feature extraction module 32
- the latter feature tends to be the extraction direction of the spectrogram feature after removing the noise from the speech signal, and the spectrogram feature is extracted from the speech signal.
- the speech signal input to the multi-modal speech recognition model may be a spectrogram obtained by performing short-time Fourier transform on the original speech signal, and the spectrogram feature extracted from the speech signal by the speech information extraction module 31 may be a frequency spectrum.
- Hidden layer features of the graph may be a spectrogram obtained by performing short-time Fourier transform on the original speech signal, and the spectrogram feature extracted from the speech signal by the speech information extraction module 31 may be a frequency spectrum.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction, and another way to obtain the fusion feature can be:
- the feature fusion module 33 uses the spectrogram feature after noise removal from the voice signal as the fusion direction to fuse the extracted spectrogram feature and the image feature sequence to obtain the fusion feature of the fused voice signal and image sequence.
- one implementation manner of extracting two types of voice information from the voice signal by using the voice information extraction module 31 may be:
- the voice information extraction module 31 is used to extract the spectrogram features, acoustic features, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
- the fusion feature is close to the spectrogram feature after removing noise from the voice signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal.
- the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where the acoustic feature extraction module is used to extract the acoustic features from the speech signal, and the spectrogram feature extraction module The spectrogram feature extracted from the speech signal and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
- the fused feature is close to the spectrogram feature after removing the noise from the speech signal. It is the extraction direction.
- the spectrogram feature extraction module is used to extract the spectrogram feature from the speech signal, the acoustic feature extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32
- the fused features tend to be the extraction direction of the spectrogram feature after noise removal from the speech signal, and the spectrogram feature is extracted from the speech signal.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
- the spectrogram feature after removing noise from the speech signal is used as the fusion direction, and the spectrogram feature and the first fusion feature are fused to obtain the fusion voice signal and image sequence. Fusion features.
- another implementation manner of extracting the two types of voice information from the voice signal by using the voice information extraction module 31 may be:
- the voice information extraction module 31 uses the voice information extraction module 31 to extract the spectrogram features, acoustic features, and the image feature extraction module 32 from the image sequence, the fusion features after the fusion of the image feature sequence are close to the acoustic features after removing the noise from the voice signal as the extraction direction , Extract spectrogram features and acoustic features from the speech signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
- the acoustic feature extraction module is used to extract the acoustic features, the spectrogram features extracted from the voice signal by the spectrogram extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module.
- the fusion features are close to the voice signal
- the acoustic feature after noise removal is the extraction direction, and the acoustic feature is extracted from the speech signal;
- the spectrogram feature extraction module is used to use the extracted spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module.
- the fusion feature is close to the voice
- the acoustic feature after the signal is denoised is the extraction direction, and the spectrogram feature is extracted from the speech signal.
- the above-mentioned voice information that tends to remove noise from the voice signal is used as the fusion direction, and the feature fusion module 33 is used to fuse the extracted voice information and image feature sequence to obtain the fusion feature.
- One way to achieve the fusion feature can be:
- the second feature fusion module of the feature fusion module 33 is used to fuse the acoustic features after noise removal from the voice signal as the fusion direction, and fuse the extracted acoustic features and the second fusion feature to obtain a fusion of the voice signal and image sequence. Fusion features.
- the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
- Using the voice information extraction module 31 to extract the spectrogram features, acoustic features, and the image feature extraction module 32 from the image sequence extracted from the image feature sequence fusion features tend to be the acoustic features after removing noise from the voice signal, and extract The acoustic feature and image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the acoustic feature after noise removal is the extraction direction, and the spectrogram feature and the acoustic feature are extracted from the speech signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
- the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the fused feature is close to the acoustic feature after noise removal is the extraction direction. From the voice signal Extract acoustic features;
- the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
- the fusion feature is close to the right
- the acoustic feature of the voice signal after noise removal is the extraction direction, and the spectrogram feature is extracted from the voice signal.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
- the second feature fusion module of the feature fusion module 33 take the acoustic features that tend to remove noise from the voice signal as the fusion direction, and fuse the acoustic features and the second fusion feature to obtain the fusion feature of the fused voice signal and image sequence .
- the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
- the fused features of the image feature sequence are close to the spectrogram features after removing noise from the voice signal, and
- the extracted spectrogram feature and the image feature sequence extracted from the image sequence are fused and the feature is close to the spectrogram feature after removing the noise as the extraction direction, and the spectrogram feature and acoustic feature are extracted from the speech signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
- the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the spectrogram feature of the voice signal after the noise is removed is the extraction direction, and the acoustic feature is extracted from the voice signal;
- the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
- the fusion feature is close to the right
- the noise-removed spectrogram features of the speech signal, as well as the extracted spectrogram features and the image feature extraction module 32 extracts from the image sequence.
- the fused feature of the image feature sequence is close to the spectrogram feature after noise removal is the extraction direction. Extract spectrogram features from the speech signal.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
- the feature fusion module No. 5 of the feature fusion module 33 take the spectrogram feature after noise removal from the voice signal as the fusion direction, and fuse the extracted spectrogram feature and the first fusion feature to obtain the fused voice signal and image The fusion characteristics of the sequence.
- the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
- the feature is close to the spectrogram feature after noise removal from the voice signal, and the extracted acoustic feature
- the image feature extraction module 32 extracted from the image sequence, the feature after fusion of the image feature sequence is close to the acoustic feature after noise removal is the extraction direction, and the spectrogram feature and the acoustic feature are extracted from the speech signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
- the acoustic feature extraction module is used to extract the acoustic features from the speech signal by taking the extracted acoustic features and the image feature sequence extracted by the image feature extraction module 32 from the image sequence as the extraction direction. ;
- the spectrogram feature extraction module is used to take the extracted spectrogram features and the image feature sequence extracted by the image feature extraction module 32 from the image sequence.
- the feature after the fusion is close to the spectrogram feature after removing the noise as the extraction direction, from the speech signal Extract spectrogram features.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence with the voice information that tends to remove the noise from the voice signal as the fusion direction, and another way to obtain the fusion feature can be:
- the spectrogram feature after noise removal from the voice signal is used as the fusion direction, and the spectrogram feature and the image feature sequence are fused to obtain the second fusion feature;
- the feature fusion module No. 4 of the feature fusion module 33 is used to fuse the first fusion feature and the second fusion feature to obtain the fusion feature of the fused voice signal and image sequence.
- the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
- the fusion features of the image feature sequence are close to the noise-removed acoustic features, and the extracted spectrogram features, acoustic features, and
- the image feature extraction module 32 extracts the image feature sequence from the image sequence and the fusion feature is close to the spectrogram feature after removing noise from the speech signal as the extraction direction, and extracts the spectrogram feature and the acoustic feature from the speech signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
- the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the noise-removed spectrogram feature of the voice signal, and the extracted acoustic feature is fused with the image feature sequence extracted by the image feature extraction module 32.
- the feature is close to the noise-removed acoustic feature as the extraction direction, and the acoustic feature is extracted from the voice signal;
- the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
- the fusion feature is close to the right
- the spectrogram feature of the voice signal after noise removal is the extraction direction, and the spectrogram feature is extracted from the voice signal.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
- the spectrogram feature after removing noise from the speech signal is used as the fusion direction, and the spectrogram feature and the first fusion feature obtained by the No. 3 feature fusion module are fused to obtain Fusion features of voice signal and image sequence.
- the voice information extraction module 31 is used to extract the two types of voice information from the voice signal in another implementation manner:
- the voice information extraction module 31 is used to extract the spectrogram features extracted from the image sequence and the image feature sequence fusion feature that is close to the spectrogram feature after noise removal, as well as the extracted spectrogram feature, acoustic feature and image feature extraction
- the fusion feature of the image feature sequence extracted by the module 32 from the image sequence is close to the acoustic feature after removing noise from the voice signal as the extraction direction, and the spectrogram feature and the acoustic feature are extracted from the voice signal.
- the acoustic feature extraction module of the voice information extraction module 31 may be used to extract acoustic features from the voice signal
- the spectrogram feature extraction module of the voice information extraction module 31 may be used to extract the spectrogram features from the voice signal. That is to say, in the embodiment of the present application, the voice information extraction module 31 includes an acoustic feature extraction module and a spectrogram feature extraction module, where:
- the acoustic feature extraction module is used to extract the acoustic features, the spectrogram feature extracted from the speech signal by the spectrogram feature extraction module, and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the acoustic features of the voice signal after noise removal is the extraction direction, and the acoustic features are extracted from the voice signal;
- the spectrogram feature extraction module is used to extract the spectrogram features, the acoustic features extracted from the speech signal by the acoustic feature extraction module, and the image feature sequence extracted from the image sequence by the image feature extraction module 32.
- the fusion feature is close to the right
- the feature after fusion is close to the spectrogram feature after the noise is removed.
- the spectrogram feature is the extraction direction, and the spectrogram is extracted from the voice signal feature.
- the above-mentioned method uses the feature fusion module 33 to fuse the voice information and the image feature sequence to obtain the fusion feature by taking the voice information that tends to remove the noise from the voice signal as the fusion direction.
- the No. 1 feature fusion module of the feature fusion module 33 take the spectrogram feature after noise removal from the speech signal as the fusion direction, and fuse the extracted spectrogram feature and the image feature sequence to obtain the second fusion feature;
- the second feature fusion module of the feature fusion module 33 take the acoustic features that tend to remove noise from the voice signal as the fusion direction, and fuse the acoustic features and the second fusion feature to obtain the fusion feature of the fused voice signal and image sequence .
- the voice signal input to the multi-modal voice recognition model may be the initial acoustic feature extracted from the original voice signal through a sliding window, and obtained by performing short-time Fourier transform on the original voice signal Spectrogram
- the acoustic feature extracted from the voice signal by the voice information extraction module 31 may be the hidden layer feature of the initial acoustic feature
- the spectrogram feature extracted from the voice signal may be the hidden layer feature of the spectrogram.
- FIGS. 4a and 4b where FIG. 4a is a schematic diagram of an architecture for training a multi-modal speech recognition model provided by an embodiment of this application, and FIG. 4b is a diagram of a multi-modal speech recognition model.
- An implementation flow chart of the training of the speech recognition model may include:
- Step S41 Obtain the noise-free speech information (ie, the clear speech information in Figure 4a) of the noise-free speech signal (also called clear speech signal) in the training sample through the multi-modal speech recognition model, and the training sample contains
- the noise voice information of the noise-free voice signal is the noise voice signal.
- the noise-free voice signal can be generated by adding noise to the noise-free voice signal.
- the noise-free speech signal can be obtained by performing denoising processing on the noisy speech signal.
- Step S42 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- the sample image feature sequence can be extracted from the sample image sequence by the image feature extraction module 32.
- Step S43 Fuse the noisy voice information and the feature sequence of the sample image through the multimodal voice recognition model to obtain the fusion feature of the training sample.
- the noisy voice information and the sample image feature sequence can be fused by the feature fusion module 33 to obtain the fusion feature of the training sample.
- Step S44 Use the fusion feature of the training sample to perform voice recognition through the multimodal voice recognition model, and obtain the voice recognition result corresponding to the training sample.
- the recognition module 22 can use the fusion feature of the training sample to perform speech recognition, and obtain the speech recognition result corresponding to the training sample.
- Step S45 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the noise-free speech information, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample.
- the parameters are updated.
- the first loss function can be used to calculate the difference between the fusion feature of the training sample and the noise-free speech information (for ease of description, it is recorded as the first difference)
- the second loss function is used to calculate the speech recognition result and the training sample corresponding to the training sample.
- the difference of the sample label of (for ease of description, it is recorded as the second difference)
- the parameters of the multi-modal speech recognition model are updated according to the weighted sum of the first difference and the second difference.
- the multi-modal speech recognition model trained based on the multi-modal speech recognition model training method shown in Fig. 4a-Fig. 4b has the ability to obtain information that is close to the noise removal of the speech signal as the acquisition direction to obtain the fused speech signal and image
- the information of the sequence is used as the fusion information; the ability to use the fusion information for speech recognition to obtain the speech recognition result of the speech signal.
- the following describes the training process of the multi-modal speech recognition model according to the difference of the speech information.
- Figure 5a is a schematic diagram of an architecture for training a multi-modal speech recognition model
- Figure 5b is a multi-modal speech recognition model.
- An implementation flow chart for training the modal speech recognition model can include:
- Step S51 Acquire the acoustic features of the noise-free speech signal in the training sample through the multi-modal speech recognition model (that is, the clear acoustic feature in Figure 5a, which can also be called the noise-free acoustic feature), and the training sample contains the above-mentioned noise.
- the acoustic characteristics of the noisy speech signal of the noisy speech signal ie, the noise acoustic characteristic in Figure 5a).
- the acoustic feature extraction module of the speech information extraction module 31 can extract clear acoustic features from a noise-free speech signal, and extract noise acoustic features from a noisy speech signal.
- the acquisition process of the noise voice signal and the noise-free voice signal can refer to the foregoing embodiment, and will not be repeated here.
- Step S52 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S53 Fuse the noise acoustic feature and the sample image feature sequence through the multi-modal speech recognition model to obtain the fusion feature of the training sample.
- Step S54 Perform voice recognition using the fusion feature of the training samples through the multimodal voice recognition model, and obtain the voice recognition results corresponding to the training samples.
- Step S55 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the noise-free acoustic feature, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample. The parameters are updated.
- the first difference between the fusion feature of the training sample and the clear acoustic feature can be calculated by the first loss function
- the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function, according to The weighted sum of the first difference and the second difference updates the parameters of the multi-modal speech recognition model.
- the acoustic feature extraction module is equipped with acoustic features to extract the voice signal and the image feature extraction module 32 extracts the image sequence
- the fusion feature of the image feature sequence of the image feature tends to be the extraction direction of the acoustic feature after the noise is removed from the voice signal, and the ability to extract the acoustic feature from the voice signal;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extracted by the acoustic feature extraction module on the voice signal.
- the feature after the fusion is close to the acoustic feature after removing the noise from the voice signal as the extraction direction.
- the feature fusion module 33 has the ability to fuse the extracted acoustic features and the image feature sequence by taking the fused features approaching the acoustic features after noise removal from the speech signal as the fusion direction to obtain the fused features.
- Figure 6a is a schematic diagram of an architecture for training a multi-modal speech recognition model
- Figure 6b is An implementation flow chart for training a multi-modal speech recognition model, which may include:
- Step S61 Obtain the spectrogram feature of the noise-free speech signal in the training sample through the multi-modal speech recognition model (ie the clear spectrogram feature in Figure 6a, which can also be called the noise-free spectrogram feature), and the training sample contains The spectrogram characteristic of the noise speech signal of the noise-free speech signal (that is, the noise spectrogram characteristic in FIG. 6a).
- the spectrogram feature extraction module of the speech information extraction module 31 can extract the clear spectrogram feature from the noise-free speech signal, and extract the noise spectrogram feature from the noise speech signal.
- the acquisition process of the noise voice signal and the noise-free voice signal can refer to the foregoing embodiment, and will not be repeated here.
- Step S62 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S63 Fuse the noise spectrogram feature and the sample image feature sequence through the multi-modal speech recognition model to obtain the fusion feature of the training sample.
- the feature fusion module 33 can be used to fuse the spectrogram feature of the noise speech signal and the feature sequence of the sample image to obtain the fusion feature of the training sample.
- Step S64 Perform voice recognition using the fusion feature of the training samples through the multimodal voice recognition model, and obtain the voice recognition results corresponding to the training samples.
- Step S65 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the noise-free spectrogram feature, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample. The parameters are updated.
- the first difference between the fusion feature and the clear spectrogram feature can be calculated by the first loss function
- the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function.
- the weighted sum of the difference and the second difference updates the parameters of the multimodal speech recognition model.
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the speech signal and the image feature extraction module 32 pairs of images
- the image feature extracted by the sequence The feature after the fusion of the sequence tends to be the extraction direction of the spectrogram feature after removing the noise from the speech signal, and the ability to extract the spectrogram feature from the speech signal;
- the image feature extraction module 32 is provided with the image feature sequence extracted from the image sequence and the spectrogram feature extracted by the spectrogram feature extraction module on the voice signal.
- the feature after the fusion of the spectrogram feature of the voice signal is close to the spectrogram feature after the noise is removed from the voice signal as the extraction direction.
- the feature fusion module 33 has the ability to fuse the extracted spectrogram features and the image feature sequence to obtain the fusion feature by taking the fused feature close to the spectrogram feature after noise removal from the speech signal as the fusion direction.
- Figure 7a is a schematic diagram of an architecture for training a multi-modal speech recognition model
- Figure 7b An implementation flow chart for training the multi-modal speech recognition model may include:
- Step S71 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in FIG.
- the spectrogram characteristics of the noisy speech signal that is, the noise spectrogram characteristic in FIG. 7a
- the acoustic characteristic that is, the noise acoustic characteristic in FIG. 7a
- Step S72 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S73 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
- Step S74 Fuse the spectrogram feature of the noise speech signal and the first fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
- Step S75 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
- Step S76 Through the multi-modal speech recognition model, the fusion feature of the training sample is close to the feature of the noise-free spectrogram, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample. The parameters are updated.
- the first difference between the fusion feature and the clear spectrogram feature can be calculated by the first loss function
- the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function.
- the weighted sum of the difference and the second difference updates the parameters of the multimodal speech recognition model.
- the acoustic feature extraction module is equipped with the acoustic features for the extraction of the voice signal and the spectrogram feature extraction module for the voice signal extraction
- the spectrogram feature and image feature extraction module 32 extracts the image feature sequence of the image sequence.
- the feature after the fusion of the image feature sequence is close to the spectrogram feature after removing the noise from the speech signal as the extraction direction, and the ability to extract acoustic features from the speech signal;
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the voice
- the spectrogram feature after the noise is removed from the signal is the extraction direction, and the ability to extract the spectrogram feature from the speech signal;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
- the feature after the fusion is close to the voice
- the spectrogram feature after the signal noise is removed is the extraction direction, and the ability to extract the image feature sequence from the image sequence;
- the feature fusion module 33 has the ability to fuse acoustic features, spectrogram features, and image feature sequences with the fused features approaching the spectrogram feature after noise removal from the speech signal as the fusion direction to obtain fusion features.
- Figure 8a is a schematic diagram of another architecture for training a multimodal speech recognition model.
- Figure 8b is another implementation flow chart for training the multi-modal speech recognition model, which can include:
- Step S81 Acquire the acoustic features of the noise-free speech signal in the training sample through the multi-modal speech recognition model (ie the clear acoustic feature in Figure 8a, that is, the noise-free acoustic feature), and the training sample contains the aforementioned noise-free speech signal
- the spectrogram feature of the noise speech signal i.e. the noise spectrogram feature in Figure 8a
- the acoustic feature i.e. the noise acoustic feature in Figure 8a.
- Step S82 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S83 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
- Step S84 Fusion the acoustic features of the noisy speech signal and the second fusion feature of the training sample through the multi-modal speech recognition model to obtain the fusion feature of the training sample.
- Step S85 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
- Step S86 Through the multi-modal speech recognition model, the fusion feature of the training sample approaches the noiseless acoustic feature, and the speech recognition result corresponding to the training sample approaches the sample label of the training sample. The parameters are updated.
- the first difference between the fusion feature and the clear acoustic feature can be calculated by the first loss function
- the second difference between the speech recognition result corresponding to the training sample and the sample label of the training sample can be calculated by the second loss function, according to the first difference
- the weighted sum with the second difference updates the parameters of the multi-modal speech recognition model.
- the acoustic feature extraction module is equipped with acoustic features to extract the voice signal and the spectrogram feature extraction module to extract the voice signal
- the feature of the spectrogram feature and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the acoustic feature after removing the noise from the voice signal, which is the extraction direction, and the ability to extract the acoustic feature from the voice signal;
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the voice
- the acoustic feature after the signal is de-noised is the extraction direction, and the ability to extract the spectrogram feature from the speech signal;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
- the feature after the fusion is close to the voice
- the acoustic feature after the signal is de-noised is the extraction direction, and the ability to extract the image feature sequence from the image sequence;
- the feature fusion module 33 has the ability to fuse the extracted acoustic features, spectrogram features, and image feature sequences with the fused features approaching the acoustic features after noise removal from the speech signal as the fusion direction to obtain the fused features.
- Figure 9a is a schematic diagram of another architecture for training a multimodal speech recognition model.
- Figure 9b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
- Step S91 Acquire the acoustic features of the noise-free speech signal in the training sample through the multi-modal speech recognition model (ie the clear acoustic feature in Figure 9a, that is, the noise-free acoustic feature), and the training sample contains the aforementioned noise-free speech signal
- the spectrogram feature of the noise speech signal i.e. the noise spectrogram feature in Figure 9a
- the acoustic feature i.e. the noise acoustic feature in Figure 9a.
- Step S92 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S93 Fuse the acoustic features of the noise voice signal and the image feature sequence through the multi-modal voice recognition model to obtain the first fusion feature of the training sample.
- Step S94 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
- Step S95 Fuse the noise acoustic feature and the second fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
- Step S96 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
- Step S97 Through the multi-modal speech recognition model, the first fusion feature of the training sample approaches the noiseless acoustic feature, the fusion feature of the training sample approaches the noiseless acoustic feature, and the speech recognition result corresponding to the training sample approaches the training
- the sample label of the sample is the target, and the parameters of the multimodal speech recognition model are updated.
- the first difference between the first fusion feature of the training sample and the clear acoustic feature can be calculated through the first loss function
- the second difference between the fusion feature of the training sample and the clear acoustic feature can be calculated through the first loss function
- the second difference between the fusion feature and the clear acoustic feature of the training sample can be calculated through the second loss function.
- the loss function calculates the third difference between the speech recognition result corresponding to the training sample and the sample label of the training sample, and updates the parameters of the multimodal speech recognition model according to the weighted sum of the first difference, the second difference, and the third difference.
- the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
- the acoustic feature extraction module is equipped with acoustic features for extracting voice signals and a spectrogram feature extraction module for voice signal extraction
- the spectrogram feature and image feature extraction module 32 extracts the image feature sequence of the image sequence.
- the feature after the fusion of the image feature sequence is close to the acoustic feature after removing the noise from the voice signal.
- the acoustic feature extracted from the voice signal and the image feature extraction module 32 compares the image The image feature extracted by the sequence The feature after the fusion of the sequence is close to the acoustic feature after the noise is removed from the voice signal, which is the extraction direction, and the ability to extract the acoustic feature from the voice signal;
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the voice
- the acoustic feature after the signal is de-noised is the extraction direction, and the ability to extract the spectrogram feature from the speech signal;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
- the feature after the fusion is close to the voice
- the feature after fusion is close to the acoustic feature after removing the noise from the voice signal.
- the extraction direction is from the image The ability to extract image feature sequences from the sequence;
- the feature fusion module 33 is capable of fusing the features obtained by the fusion approach to the acoustic features after noise removal from the speech signal as the fusion direction, fusing the acoustic features and the image feature sequence to obtain the first fusion feature, and comparing the spectrogram feature and the image feature sequence The fusion is performed to obtain the second fusion feature, and the acoustic feature and the second fusion feature are fused to obtain the ability of the fusion feature.
- Figure 10a is a schematic diagram of another architecture for training a multimodal speech recognition model.
- Figure 10b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
- Step S101 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in Figure 10a, that is, the noise-free spectrogram feature) through the multimodal speech recognition model, and the training sample contains the above-mentioned non-noise
- the spectrogram feature of the noisy speech signal i.e., the noise spectrogram feature in FIG. 10a
- the acoustic feature i.e., the noise acoustic feature in FIG. 10a
- Step S102 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S103 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
- Step S104 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
- Step S105 Fuse the noise spectrogram feature and the first fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
- Step S106 Perform voice recognition on the fusion feature of the training sample through the multimodal voice recognition model, and obtain a voice recognition result corresponding to the training sample.
- Step S107 The second fusion feature of the training sample approaches the noise-free spectrogram feature through the multimodal speech recognition model, the fusion feature of the training sample approaches the noise-free spectrogram feature, and the speech recognition result corresponding to the training sample approaches Taking the sample label of the training sample as the target, the parameters of the multi-modal speech recognition model are updated.
- the first difference between the second fusion feature of the training sample and the noise-free spectrogram feature can be calculated by the first loss function
- the second difference between the fusion feature of the training sample and the noise-free spectrogram feature can be calculated by the first loss function
- the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
- the acoustic feature extraction module is equipped with the acoustic features for the voice signal extraction and the spectrogram feature extraction module for the voice signal extraction
- the spectrogram feature and image feature extraction module 32 extracts the image feature sequence of the image sequence.
- the feature after the fusion of the image feature sequence is close to the spectrogram feature after removing the noise from the speech signal as the extraction direction, and the ability to extract acoustic features from the speech signal;
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the voice
- the spectrogram feature after the noise is removed from the signal, the spectrogram feature extracted from the speech signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion is close to the spectrogram feature after removing the noise from the speech signal.
- Direction the ability to extract spectrogram features from the speech signal;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
- the feature after the fusion is close to the voice
- the spectrogram feature after the noise is removed from the signal, the image feature sequence extracted from the image sequence and the spectrogram feature extracted by the spectrogram feature extraction module for the voice signal.
- the feature after fusion is close to the spectrogram feature after removing the noise from the speech signal.
- Direction the ability to extract image feature sequences from image sequences;
- the feature fusion module 33 is equipped with the fusion direction that the fused feature is close to the spectrogram feature after noise removal from the speech signal, fuse the spectrogram feature and the image feature sequence to obtain the second fusion feature, and compare the acoustic feature and the image feature The sequence is fused to obtain the first fusion feature, and the spectrogram feature and the first fusion feature are fused to obtain the ability of the fusion feature.
- Figure 11a is a schematic diagram of another architecture for training a multimodal speech recognition model.
- Figure 11b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
- Step S111 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in Figure 11a, that is, the noise-free spectrogram feature) and the acoustic feature (ie, the noise-free spectrogram feature in Figure 11a) through the multimodal speech recognition model.
- the clear acoustic features of the above noise-free voice signals in the training samples ie the noise spectrogram features in Figure 11a
- the acoustic features ie the noise acoustic features in Figure 11a
- Step S112 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S113 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
- Step S114 Fuse the noise spectrogram feature and the image feature sequence through the multi-modal speech recognition model to obtain the second fusion feature of the training sample.
- Step S115 The first fusion feature of the training sample and the second fusion feature of the training sample are fused through the multimodal speech recognition model to obtain the fusion feature of the training sample.
- Step S116 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
- Step S117 Through the multimodal speech recognition model, the first fusion feature of the training sample approaches the noise-free acoustic feature, the second fusion feature of the training sample approaches the feature of the noise-free spectrogram, and the speech recognition result corresponding to the training sample tends to The sample label close to the training sample is the target, and the parameters of the multimodal speech recognition model are updated.
- the first difference between the first fusion feature of the training sample and the noise-free acoustic feature can be calculated by the first loss function
- the second fusion feature of the training sample and the second fusion feature of the noise-free spectrogram feature can be calculated by the first loss function
- Difference, the third difference between the speech recognition result corresponding to the training sample and the sample label of the training sample is calculated by the second loss function, based on the weighting of the first difference, the second difference and the third difference and the parameters of the multimodal speech recognition model Update.
- the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
- the acoustic feature extraction module is equipped with acoustic features to extract the voice signal and the image feature extraction module 32 extracts the image sequence
- the fusion feature of the image feature sequence of the image feature tends to be the extraction direction of the acoustic feature after the noise is removed from the voice signal, and the ability to extract the acoustic feature from the voice signal;
- the spectrogram feature extraction module is equipped with the extraction direction of the spectrogram feature extracted from the voice signal and the image feature sequence extracted by the image feature extraction module 32.
- the feature after the fusion of the image feature sequence is close to the spectrogram feature after removing the noise from the voice signal.
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extracted by the acoustic feature extraction module on the voice signal.
- the feature after the fusion is close to the acoustic feature after removing noise from the voice signal, and the image is extracted from the image sequence.
- the feature sequence and spectrogram feature extraction module has the ability to extract the image feature sequence from the image sequence.
- the feature after the fusion of the spectrogram feature extracted by the voice signal is close to the spectrogram feature after removing the noise from the voice signal as the extraction direction;
- the feature fusion module 33 is equipped with the second fusion feature obtained by the fusion approaching the spectrogram feature after noise removal from the speech signal as the fusion direction, and the spectrogram feature and the image feature sequence are fused to obtain the second fusion feature;
- the first fusion feature of the voice signal is close to the fusion direction of the acoustic feature after the noise is removed from the speech signal.
- the acoustic feature and the image feature sequence are fused to obtain the first fusion feature capability. It also has the first fusion feature and the second fusion feature Perform fusion and get the ability to merge features.
- Figure 12a is a schematic diagram of another architecture for training a multimodal speech recognition model.
- Figure 12b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
- Step S121 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in FIG. 12a, that is, the noise-free spectrogram feature) and the acoustic feature (ie, the noise-free spectrogram feature in FIG. 12a) through the multimodal speech recognition model.
- the clear acoustic characteristics of the noise-free speech signal in the training sample ie, the noise spectrogram feature in Figure 12a
- the acoustic feature ie the noise acoustic feature in Figure 12a
- Step S122 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S123 Fuse the noise acoustic feature and the image feature sequence through the multi-modal speech recognition model to obtain the first fusion feature of the training sample.
- Step S124 Fuse the noise spectrogram feature and the first fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
- Step S125 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
- Step S126 Through the multimodal speech recognition model, the first fusion feature of the training sample approaches the noise-free acoustic feature, the fusion feature of the training sample approaches the feature of the noise-free spectrogram, and the speech recognition result corresponding to the training sample approaches
- the sample label of the training sample is the target, and the parameters of the multi-modal speech recognition model are updated.
- the first difference between the first fusion feature of the training sample and the noise-free acoustic feature can be calculated by the first loss function
- the second difference between the fusion feature of the training sample and the noise-free spectrogram feature can be calculated by the first loss function
- the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
- the acoustic feature extraction module is equipped with acoustic features for extracting voice signals and a spectrogram feature extraction module for voice signal extraction
- the spectrogram feature and image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the spectrogram feature after the noise is removed from the voice signal.
- the acoustic feature extracted from the voice signal is paired with the image feature extraction module 32
- the fusion feature of the image feature sequence extracted from the image sequence is close to the acoustic feature after the noise is removed from the voice signal as the extraction direction, and the ability to extract the acoustic feature from the voice signal;
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the voice The ability of the signal to remove the noise characteristics of the spectrogram;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
- the feature after the fusion is close to the voice
- the spectrogram feature after the signal is denoised, the image feature sequence extracted from the image sequence and the acoustic feature extracted by the acoustic feature extraction module from the voice signal.
- the feature after fusion is close to the acoustic feature after removing the noise from the voice signal as the extraction direction.
- the feature fusion module 33 is provided with the first fusion feature obtained by the fusion approaching the acoustic feature after noise removal from the speech signal as the fusion direction, and the acoustic feature and the image feature sequence are fused to obtain the first fusion feature to be fused.
- the features tend to be the fusion direction of the spectrogram feature after noise removal from the speech signal.
- the spectrogram feature and the first fusion feature are fused to obtain the ability to fuse features.
- Figure 13a is a schematic diagram of another architecture for training a multimodal speech recognition model.
- Figure 13b is another implementation flow chart for training the multi-modal speech recognition model, which may include:
- Step S131 Obtain the spectrogram feature of the noise-free speech signal in the training sample (ie the clear spectrogram feature in Figure 13a, that is, the noise-free spectrogram feature) and the acoustic feature (ie, the noise-free spectrogram feature in Figure 13a) through the multimodal speech recognition model.
- Step S132 Obtain the sample image feature sequence of the sample image sequence in the training sample through the multi-modal speech recognition model.
- Step S133 Fuse the noise spectrogram feature and the image feature sequence through the multimodal speech recognition model to obtain the second fusion feature of the training sample.
- Step S134 Fuse the noise acoustic feature and the second fusion feature of the training sample through the multimodal speech recognition model to obtain the fusion feature of the training sample.
- Step S135 Perform voice recognition on the fusion feature of the training sample through the multi-modal voice recognition model to obtain a voice recognition result corresponding to the training sample.
- Step S136 Through the multi-modal speech recognition model, the second fusion feature of the training sample approaches the noise-free spectrogram feature, the fusion feature of the training sample approaches the noise-free acoustic feature, and the speech recognition result corresponding to the training sample approaches
- the sample label of the training sample is the target, and the parameters of the multi-modal speech recognition model are updated.
- the first difference between the second fusion feature of the training sample and the noise-free spectrogram feature can be calculated through the first loss function
- the second difference between the fusion feature of the training sample and the noise-free acoustic feature can be calculated through the first loss function
- Calculate the third difference between the speech recognition result corresponding to the training sample and the sample label of the training sample through the second loss function and update the parameters of the multi-modal speech recognition model according to the weighted sum of the first difference, the second difference and the third difference .
- the loss functions used to calculate the first difference and the second difference are the same. In an optional embodiment, the loss functions used to calculate the first difference and the second difference may also be different, which is not specifically limited in this application.
- the acoustic feature extraction module is equipped with the acoustic features for the voice signal extraction and the spectrogram feature extraction module for the voice signal extraction
- the feature of the spectrogram feature and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the acoustic feature after removing the noise from the voice signal, which is the extraction direction, and the ability to extract the acoustic feature from the voice signal;
- the spectrogram feature extraction module is equipped with the spectrogram feature extracted from the voice signal and the acoustic feature extraction module extracts the acoustic feature of the voice signal and the image feature extraction module 32 extracts the image feature sequence from the image sequence.
- the feature after the fusion of the image feature sequence is close to the voice
- the acoustic features of the signal after noise removal, the spectrogram feature extracted from the voice signal and the image feature sequence extracted by the image feature extraction module 32, the feature after fusion is close to the ability of the spectrogram feature after noise removal from the voice signal;
- the image feature extraction module 32 is equipped with the image feature sequence extracted from the image sequence and the acoustic feature extraction module extracts the acoustic features of the speech signal and the spectrogram feature extraction module extracts the spectrogram feature of the speech signal.
- the feature after the fusion is close to the voice
- the feature after fusion is close to the spectrogram feature after the noise is removed from the voice signal.
- the extraction direction The ability to extract image feature sequences from image sequences;
- the feature fusion module 33 is provided with the second fusion feature obtained by the fusion approaching the spectrogram feature after noise removal from the speech signal as the fusion direction, and the spectrogram feature and the image feature sequence are fused to obtain the second fusion feature, which is then fused to obtain The fusion feature of is close to the fusion direction of the acoustic feature after noise removal from the speech signal, and the acoustic feature and the second fusion feature are fused to obtain the ability of the fusion feature.
- the weight of each difference is not limited, and the weight corresponding to each difference may be the same or different.
- the weight of each difference can be set in advance, or it can be learned during the training process of the multi-modal speech recognition model. Taking the embodiment shown in FIG. 5a as an example, optionally, the weight of the first difference may be 0.2, and the weight of the second difference may be 0.8.
- the first loss function may be an L2 norm or an L1 norm
- the second loss function may be a cross-entropy function
- the inventor of the present application found that the amount of audio/video data collected synchronously is usually small, and the multi-modal speech recognition model obtained by training with only synchronously collected audio/video data as a training sample may cause overfitting. To avoid over-fitting, and to further improve the recognition accuracy of the multi-modal speech recognition model, some functional modules can be pre-trained before training the multi-modal speech recognition model.
- the initial parameters of the acoustic feature extraction module of the speech information extraction module 31 are trained with the speech signal and its corresponding speech content as the training data.
- the parameters of the feature extraction module used to extract the acoustic features of the speech signal are trained with the speech signal and its corresponding speech content as the training data.
- the initial parameters of the acoustic feature extraction module are the parameters of the feature extraction module in the speech recognition model trained with pure speech samples.
- the specific architecture of the speech recognition model is not limited, but no matter what the architecture of the speech recognition model is, the feature extraction module is a necessary functional module.
- the speech recognition model may include: a feature extraction module for extracting hidden layer features of the acoustic features of the input speech recognition model; a recognition module for extracting hidden layer features based on the feature extraction module Perform voice recognition.
- the training process of the speech recognition model can refer to the existing training methods, which will not be described in detail here.
- the speech samples used to train the speech recognition model may include the speech samples used to train the aforementioned multi-modal speech recognition model, or may not include the aforementioned speech samples used to train the aforementioned multi-modal speech recognition model. There is no specific limitation.
- the initial parameters of the spectrogram feature extraction module are: the speech signal and its corresponding spectrogram label are used as training data in the trained speech separation model , The parameter of the spectrogram feature extraction module used for feature extraction of the spectrogram of the speech signal.
- the initial parameters of the spectrogram feature extraction module are the parameters of the spectrogram feature extraction module in the speech separation model trained with pure speech samples.
- the specific architecture of the speech separation model is not limited, but no matter what the architecture of the speech separation model is, the spectrogram feature extraction module is a necessary functional module.
- the voice separation model may include: a spectrogram feature extraction module for extracting hidden layer features of the spectrogram of the input voice separation model; a separation module for extracting features based on the spectrogram feature extraction module
- the hidden layer features are used for speech separation.
- the training process of the speech separation model can refer to the existing training methods, which will not be detailed here.
- the voice samples used to train the voice separation model may include the voice samples used to train the multimodal voice recognition model, or may not include the voice samples used to train the multimodal voice recognition model. This application does not Make specific restrictions.
- the initial parameters of the image feature extraction module are: The parameters of the image feature extraction module used for feature extraction of the image sequence.
- the initial parameters of the image feature extraction module are the parameters of the image feature extraction module in the lip language recognition model trained with pure image sequence samples.
- the specific architecture of the lip language recognition model is not limited, but no matter what the architecture of the lip language recognition model is, the image feature extraction module is a necessary functional module.
- the lip language recognition model may include: an image feature extraction module for extracting the hidden layer feature sequence of the image sequence of the input lip language recognition model; the recognition module for extracting the module based on the image feature The extracted hidden layer feature sequence is used for lip recognition.
- the training process of the lip recognition model can refer to the existing training methods, which will not be detailed here.
- the image sequence samples used to train the lip language recognition model may include the image sequence samples used to train the above-mentioned multi-modal speech recognition model, or may not include the above-mentioned image sequence samples used to train the above-mentioned multi-modal speech recognition model. This application does not specifically limit this.
- the recognition module 22 uses the fusion feature to perform speech recognition, and the obtained speech recognition result is usually a phoneme-level recognition result, such as a triphone. After the triphone is obtained, the phoneme can be decoded by the Viterbi algorithm. Text sequence.
- the specific decoding process can refer to existing methods, which will not be described in detail here.
- the voice signal input to the multi-modal voice recognition model may be an acoustic feature extracted from the original voice signal and/or a spectrogram obtained from the original voice signal through short-time Fourier transform.
- the input to the multi-modal speech recognition model is the acoustic features extracted from the original speech signal (for example, the fbank feature); taking the fbank feature as an example, you can pass
- the sliding window extracts the fbank feature, where the window length can be 25ms, and the frame shift is 10ms, that is, the voice signals of two adjacent sliding window positions have an overlap of 15ms.
- the 40-dimensional fbank feature (of course, it can also be other dimensions, and this application does not specifically limit it) vector, the fbank feature obtained in this way is a 100fps fbank feature vector sequence.
- the feature extracted by the multi-modal speech recognition model from the input fbank feature is the hidden layer feature of the fbank feature.
- the input to the multi-modal speech recognition model is the spectrogram obtained from the original speech signal through short-time Fourier transform; the multi-modal speech recognition model What is extracted from the input spectrogram is the hidden layer feature of the spectrogram.
- the input to the multi-modal speech recognition model is the acoustic features extracted from the original speech signal and the original speech
- the signal is a spectrogram obtained by short-time Fourier transform.
- the frame rate of the video is usually 25fps.
- the text annotation of the sample speech signal is also preprocessed. Specifically, forcealignment can be used to pronounce the text.
- the phonemes are aligned to the voice signal, where every 4 frames of voice signal (every time the sliding window slides to a position, a frame of voice signal is determined) corresponds to a triphone, so that the text label is actually converted into a triphone label.
- the marked frame rate is 25fps, which is a quarter of the audio frame rate and is synchronized with the video frame rate.
- the noisy speech signal input to the multimodal speech recognition model can be a speech frame of 100fps (for ease of description, it is recorded as a noisy speech frame, and the noisy speech frame passes
- the initial fbank feature vector sequence (for ease of description, it is recorded as the initial noise fbank feature vector sequence)
- the initial fbank feature vector sequence of the initial noise fbank feature vector sequence is obtained by sliding the sliding window with a frame shift of 10ms in the original noise speech signal.
- Each of the initial noise fbank eigenvectors is a 40-dimensional eigenvector.
- the noise-free speech signal input to the multi-modal speech recognition model can be a 100fps speech frame (for ease of description, it is recorded as a noise-free speech frame.
- the noise-free speech frame has a window length of 25ms and a frame shift of 10ms.
- the initial fbank feature vector sequence of the window sliding in the original noise-free speech signal (for ease of description, it is recorded as the initial noise-free fbank feature vector sequence), and each initial noise-free fbank feature in the initial noise-free fbank feature vector sequence
- the vectors are all 40-dimensional feature vectors.
- the initial noise fbank feature vector sequence will be sampled 4 times in the time dimension after passing through the acoustic feature extraction module to obtain a 512-dimensional noise fbank feature vector sequence at 25fps; the initial noise-free fbank feature vector sequence will be in the time dimension after the acoustic feature extraction module Down-sampling is 4 times, and a 512-dimensional noise-free fbank feature vector sequence of 25fps is obtained.
- the image sequence input to the multimodal speech recognition model can be a 25fps image sequence, an RGB three-channel image with an image size of 80 ⁇ 80, and a 512-dimensional image feature vector sequence at 25fps after an image feature extraction module.
- the 25fps 512-dimensional noise fbank feature vector sequence and the 25fps 512-dimensional image feature vector sequence are input to the feature fusion module.
- the feature fusion module receives a noise fbank feature vector and an image feature vector, the noise fbank feature vector and the image
- the feature vector is fused (for example, the noise fbank feature vector and the image feature vector are spliced), and then a 512-dimensional fusion feature vector is generated through a small fusion neural network, and the 512-dimensional fusion feature vector is output to the recognition module.
- the recognition module performs phoneme recognition on the received 512-dimensional fusion feature vector after softmax classification, and obtains the triphone recognition result.
- the loss function used to update the parameters of the multimodal speech recognition model consists of two parts:
- the 512-dimensional fusion feature vector is combined with the corresponding
- the 512-dimensional noise-free fbank feature vector is used as part of the loss function as the L2 norm, so that the fused feature vector is closer to the corresponding 512-dimensional noise-free fbank feature vector, thereby achieving the effect of reducing noise at the feature level.
- the cross-entropy function of the recognition result of the recognition module and the triphone label is calculated as another part of the loss function.
- the speech signal input to the multimodal speech recognition model can be the initial fbank feature vector sequence of a 100fps speech frame; the initial fbank feature vector sequence will be in time after passing through the acoustic feature extraction module The dimensionality is down-sampled by 4 times to obtain a 512-dimensional fbank feature vector sequence at 25fps; the image sequence input to the multimodal speech recognition model can be a 25fps image sequence, and the image size is 80 ⁇ 80 RGB three-channel image, after image feature extraction After the module, a 25fps 512-dimensional image feature vector sequence is obtained; a 25fps 512-dimensional fbank feature vector sequence and a 25fps 512-dimensional image feature vector sequence are input to the feature fusion module.
- the feature fusion module receives a fbank feature vector and an image feature each time Vector, the fbank feature vector and the image feature vector are fused to generate a 512-dimensional fusion feature vector, and the 512-dimensional fusion feature vector is output to the recognition module.
- the recognition module performs phoneme recognition on the received 512-dimensional fusion feature vector after softmax classification, and obtains the triphone recognition result.
- the inventor of the present application found that the current multimodal speech recognition method that assists speech recognition with the help of lip motion videos is extremely sensitive to the training data set. For example, if most of the data in the training set is English data, a small amount is The addition of Chinese data and lip motion information may cause Chinese to be recognized as English under high noise, but it reduces the effect of speech recognition.
- the solution based on this application can significantly alleviate the recognition confusion caused by the imbalance of the language of the training data set, and further improve the multi-modal speech recognition effect in a high-noise environment.
- the multi-modal speech recognition model of the present application has low dependence on the training set. Even if the language distribution of the samples in the training data set is not uniform, the trained multi-modal speech recognition model can accurately perform multi-lingual (can be The recognized language is the language included in the training sample set) speech recognition, which greatly reduces the problem of recognition confusion.
- the training sample set used for training the above-mentioned multi-modal speech recognition model may only include training samples of a single language, or may include training samples of two or more languages.
- the proportion of training samples of each language in the training sample set is randomly determined, or is a preset proportion.
- test set for testing is mainly English corpus, and only a small part of Chinese corpus.
- the embodiment of the present application also provides a voice recognition device.
- a schematic structural diagram of the voice recognition device provided in the embodiment of the present application is shown in FIG. 14, and may include:
- the acquiring module 141 is configured to acquire a voice signal and an image sequence collected synchronously with the voice signal; the image in the image sequence is an image of a region related to lip movement;
- the feature extraction module 142 is configured to use the voice information that tends to remove the noise from the voice signal as the acquisition direction, and obtain the information that merges the voice signal and the image sequence as the fusion information;
- the recognition module 143 is configured to perform voice recognition using the fusion information to obtain a voice recognition result of the voice signal.
- the speech recognition device when acquiring the fusion characteristics of the speech signal and the image sequence, takes the fusion information approaching the speech information after the speech signal denoising as the acquisition direction, that is, the obtained fusion information
- the voice information that is close to the noise-free voice signal reduces the interference of the noise in the voice signal on the voice recognition, thereby improving the voice recognition rate.
- the functions of the feature extraction module 142 and the recognition module 143 can be implemented by a multi-modal speech recognition model, specifically:
- the feature extraction module 142 may be specifically used to: use a multi-modal speech recognition model to obtain information that is close to removing noise from the speech signal as an acquisition direction, and obtain information that merges the speech signal and the image sequence as a fusion information;
- the recognition module 143 may be specifically used for the ability to perform voice recognition using the fusion information through a multimodal voice recognition model, and obtain the voice recognition result of the voice signal.
- the feature extraction module 142 may be specifically configured to: take the voice information obtained by removing noise from the voice signal as the acquisition direction, and use the voice information extraction of the multi-modal voice recognition model.
- the module extracts speech information from the speech signal, uses the image feature extraction module of the multimodal speech recognition model to extract the image feature sequence from the image sequence; uses the feature fusion module of the multimodal speech recognition model to Fusing the voice information and the image feature sequence to obtain a fusion feature that merges the voice signal and the image sequence;
- the recognition module 143 may be specifically configured to: use a recognition module of a multi-modal speech recognition model to perform speech recognition based on the fusion feature to obtain a speech recognition result of the speech signal.
- the feature extraction module 142 may specifically include:
- the extraction module is used to take the speech information extracted from the speech signal and the image feature sequence extracted from the image sequence as the extraction direction, and the feature after the fusion of the image feature sequence is close to the speech information after removing the noise from the speech signal.
- the voice information extraction module of the multimodal voice recognition model extracts voice information from the voice signal, and the image feature extraction module of the multimodal voice recognition model extracts an image feature sequence from the image sequence;
- the fusion module is used to use the feature fusion module of the multi-modal speech recognition model to perform the fusion of the voice information and the image feature sequence with the voice information approaching the voice signal after the noise is removed as the fusion direction. Fusion, get fusion features.
- the voice information is of N types, and the N is a positive integer greater than or equal to 1.
- the extraction module uses the voice information extraction module of the multi-modal speech recognition model to obtain the When extracting voice information from the signal, it is specifically used for:
- the feature after the fusion of the extracted N types of voice information and the image feature sequence extracted from the image sequence approaches the one after removing the noise from the voice signal.
- One type of voice information is the extraction direction, and N types of voice information are extracted from the voice signal; or,
- the voice information extraction module of the multi-modal voice recognition model is used, so that each of the extracted voice information and the feature after the fusion of the image feature sequence extracted from the image sequence are close to the opposite.
- the voice information after noise removal from the voice signal is the extraction direction, and N voice information is extracted from the voice signal.
- the voice information is an acoustic feature and/or a spectrogram feature
- the fusion module may be specifically used for:
- Fusion method 1 Use the feature fusion module of the multi-modal speech recognition model to fuse the acoustic features and the image feature sequence with the acoustic features that tend to denoise the speech signal as the fusion direction , Get the fusion feature corresponding to the fusion mode;
- Fusion method 2 Using the feature fusion module of the multi-modal speech recognition model, take the spectrogram feature after denoising the speech signal as the fusion direction, and compare the spectrogram feature and the image feature sequence Perform fusion to obtain the fusion feature corresponding to fusion mode two;
- Fusion method 3 Use the feature fusion module of the multi-modal speech recognition model to use the acoustic features or spectrogram features that tend to denoise the voice signal as the fusion direction, and compare the acoustic features and the frequency spectrum The image feature and the image feature sequence are fused to obtain the fusion feature corresponding to the three fusion modes.
- the speech recognition device further includes a training module for:
- the fusion feature of the training sample is close to the noise-free speech information, and the speech recognition result corresponding to the training sample is close to the sample label of the training sample.
- the parameters of the multi-modal speech recognition model are updated.
- the initial parameters of the acoustic feature extraction module are: Parameters of the feature extraction module used for acoustic feature extraction of speech signals.
- the initial parameters of the spectrogram feature extraction module are the speech separation model trained using the speech signal and its corresponding spectrogram label as the training data , The parameters of the spectrogram feature extraction module used to perform feature extraction on the spectrogram of the speech signal.
- the initial parameters of the image feature extraction module are: the lip language recognition model trained with the image sequence and its corresponding pronunciation content as the training data , The parameters of the image feature extraction module used for feature extraction of the image sequence.
- the training sample set used for training the multimodal speech recognition model includes training samples of different languages, and the proportion of training samples of each language in the training sample set is randomly determined , Or a preset ratio.
- FIG. 15 shows a block diagram of the hardware structure of the speech recognition device.
- the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication Bus 4;
- the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
- the processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
- ASIC Application Specific Integrated Circuit
- the memory 3 may include a high-speed RAM memory, or may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory;
- the memory stores a program
- the processor can call the program stored in the memory, and the program is used for:
- the image in the image sequence is an image of a region related to lip movement
- the embodiments of the present application also provide a storage medium, the storage medium may store a program suitable for execution by a processor, and the program is used for:
- the image in the image sequence is an image of a region related to lip movement
- the disclosed system, device, and method can be implemented in other ways.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (15)
- 一种语音识别方法,其特征在于,包括:获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
- 根据权利要求1所述的方法,其特征在于,获取融合信息,利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果的过程,包括:利用多模态语音识别模型处理所述语音信号和所述图像序列,得到所述多模态语音识别模型输出的语音识别结果;其中,所述多模态语音识别模型具备以趋近于对所述语音信号去除噪声后的信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果的能力。
- 根据权利要求2所述的方法,其特征在于,所述利用多模态语音识别模型处理所述语音信号和所述图像序列,得到所述多模态语音识别模型输出的语音识别结果,包括:以趋近于对所述语音信号去除噪声后的语音信息为获取方向,利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,利用所述多模态语音识别模型的图像特征提取模块从所述图像序列中提取图像特征序列;利用所述多模态语音识别模型的特征融合模块对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征;利用多模态语音识别模型的识别模块,基于所述融合特征进行语音识别,得到所述语音信号的语音识别结果。
- 根据权利要求3所述的方法,其特征在于,所述语音信息为N种,所述N为大于或等于1的正整数;所述利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,包括:利用所述多模态语音识别模型的语音信息提取模块,以提取的N种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的一种语音信息为提取方向,从所述语音信号中提取N种语音信息;或者,若所述N大于1,则利用所述多模态语音识别模型的语音信息提取模块,以提取的每一种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的该种语音信息为提取方向,从所述语音信号中提取N种语音信息。
- 根据权利要求4所述的方法,其特征在于,所述语音信息为声学特征和/或频谱图特征,所述以趋近于对所述语音信号去除噪声后的语音信息为融合方向,利用所述多模态语音识别模型的特征融合模块,对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征,包括:根据如下三种融合方式中的任意一种或任意两种的组合得到的融合特征获取融合所述语音信号和所述图像序列的融合特征:融合方式一:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的声学特征为融合方向,对所述声学特征和所述图像特征序列进行融合,得到融合方式一对应的融合特征;融合方式二:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的频谱图特征为融合方向,对所述频谱图特征和所述图像特征序列进行融合,得到融合方式二对应的融合特征;融合方式三:利用所述多模态语音识别模型的特征融合模块,以趋近于对所述语音信号去噪后的声学特征或频谱图特征为融合方向,对所述声学特征、所述频谱图特征和所述图像特征序列进行融合,得到融合方式三对应的融合特征。
- 根据权利要求2所述的方法,其特征在于,所述多模态语音识别模型的训练过程包括:分别获取训练样本中的无噪声语音信号的无噪声语音信息,和所述训练样本中包含所述无噪声语音信号的噪声语音信号的噪声语音信息;获取所述训练样本中的样本图像序列的样本图像特征序列;将所述噪声语音信息和所述样本图像特征序列进行融合,得到所述训练样本的融合特征;利用所述训练样本的融合特进行语音识别,得到所述训练样本对应的语音识别结果;以所述训练样本的融合特征趋近于所述无噪声语音信息,所述训练样本对应的语音识别结果趋近于所述训练样本的样本标签为目标,对所述多模态语音识别模型的参数进行更新。
- 根据权利要求6所述的方法,其特征在于,分别获取无噪声语音信息和噪声语音信息的过程,包括:利用所述多模态语音识别模型中的声学特征提取模块获取所述无噪声语音信号的无噪声声学特征和所述噪声语音信号的噪声声学特征;和/或,利用所述多模态语音识别模型中的频谱图特征提取模块获取所述无噪声语音信号的无噪声频谱图特征和所述噪声语音信号的噪声频谱图特征;所述声学特征提取模块的初始参数为,以语音信号及其对应的语音内容为训练数据训练好的语音识别模型中,用于对语音信号进行声学特征提取的特征提取模块的参数;所述频谱图特征提取模块的初始参数为,以语音信号及其对应的频谱图标签为训练数据训练好的语音分离模型中,用于对语音信号的频谱图进行特征提取的频谱图特征提取模块的参数。
- 根据权利要求6所述的方法,其特征在于,所述获取所述训练样本中的样本图像序列的样本图像特征序列,包括:利用所述多模态语音识别模型中的图像特征提取模块获取所述样本图像序列的样本图像特征序列;所述图像特征提取模块的初始参数为,以图像序列及其对应的发音内容为训练数据训练好的唇语识别模型中,用于对图像序列进行特征提取的图像特征提取模块的参数。
- 根据权利要求6所述的方法,其特征在于,训练所述多模态语音识别模型所使用的训练样本集合中,包括不同语种的训练样本,所述训练样本集合中各个语种的训练样本所占的比例随机确定,或为预置比例。
- 一种语音识别装置,其特征在于,包括:获取模块,用于获取语音信号和与所述语音信号同步采集的图像序列;所述图像序列中的图像为唇动相关区域的图像;特征提取模块,用于以趋近于对所述语音信号去除噪声后的语音信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;识别模块,用于利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
- 根据权利要求10所述的装置,其特征在于,所述特征提取模块具体用于:通过多模态语音识别模型以趋近于对所述语音信号去除噪声后的信息为获取方向,获取融合所述语音信号和所述图像序列的信息,作为融合信息;所述识别模块具体用于:通过所述多模态语音识别模型利用所述融合信息进行语音识别,得到所述语音信号的语音识别结果。
- 根据权利要求11所述的装置,其特征在于,所述特征提取模块具体用于:以趋近于对所述语音信号去除噪声后的语音信息为获取方向,利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息,利用所述多模态语音识别模型的图像特征提取模块从所述图像序列中提取图像特征序列;利用所述多模态语音识别模型的特征融合模块,对所述语音信息和所述图像特征序列进行融合,获取融合所述语音信号和所述图像序列的融合特征;所述识别模块具体用于:利用所述多模态语音识别模型的识别模块,基于所述融合特征进行语音识别,得到所述语音信号的语音识别结果。
- 根据权利要求12所述的装置,其特征在于,所述语音信息为N种,所述N为大于或等于1的正整数;所述提取模块在利用所述多模态语音识别模型的语音信息提取模块从所述语音信号中提取语音信息时,具体用于:利用所述多模态语音识别模型的语音信息提取模块,以提取的N种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的一种语音信息为提取方向,从所述语音信号中提取N种语音信息;或者,若所述N大于1,则利用所述多模态语音识别模型的语音信息提取模块,以提取的每一种语音信息与对所述图像序列提取的图像特征序列融合后的特征趋近于对所述语音信号去除噪声后的该种语音信息为提取方向,从所述语音信号中提取N种语音信息。
- 一种语音识别设备,其特征在于,包括存储器和处理器;所述存储器,用于存储程序;所述处理器,用于执行所述程序,实现如权利要求1-9中任一项所述的语音识别方法的各个步骤。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1-9中任一项所述的语音识别方法的各个步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010129952.9A CN111312217A (zh) | 2020-02-28 | 2020-02-28 | 语音识别方法、装置、设备及存储介质 |
CN202010129952.9 | 2020-02-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021169023A1 true WO2021169023A1 (zh) | 2021-09-02 |
Family
ID=71159496
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/087115 WO2021169023A1 (zh) | 2020-02-28 | 2020-04-27 | 语音识别方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111312217A (zh) |
WO (1) | WO2021169023A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883130A (zh) * | 2020-08-03 | 2020-11-03 | 上海茂声智能科技有限公司 | 一种融合式语音识别方法、装置、系统、设备和存储介质 |
CN112786052B (zh) * | 2020-12-30 | 2024-05-31 | 科大讯飞股份有限公司 | 语音识别方法、电子设备和存储装置 |
CN113470617B (zh) * | 2021-06-28 | 2024-05-31 | 科大讯飞股份有限公司 | 语音识别方法以及电子设备、存储装置 |
CN117116253B (zh) * | 2023-10-23 | 2024-01-12 | 摩尔线程智能科技(北京)有限责任公司 | 初始模型的训练方法、装置、语音识别方法及装置 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389097A (zh) * | 2014-09-03 | 2016-03-09 | 中兴通讯股份有限公司 | 一种人机交互装置及方法 |
CN106328156A (zh) * | 2016-08-22 | 2017-01-11 | 华南理工大学 | 一种音视频信息融合的麦克风阵列语音增强系统及方法 |
US20170236516A1 (en) * | 2016-02-16 | 2017-08-17 | Carnegie Mellon University, A Pennsylvania Non-Profit Corporation | System and Method for Audio-Visual Speech Recognition |
CN109814718A (zh) * | 2019-01-30 | 2019-05-28 | 天津大学 | 一种基于Kinect V2的多模态信息采集系统 |
CN110111783A (zh) * | 2019-04-10 | 2019-08-09 | 天津大学 | 一种基于深度神经网络的多模态语音识别方法 |
CN110503957A (zh) * | 2019-08-30 | 2019-11-26 | 上海依图信息技术有限公司 | 一种基于图像去噪的语音识别方法及装置 |
CN110544479A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种去噪的语音识别方法及装置 |
CN110545396A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种基于定位去噪的语音识别方法及装置 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10178301B1 (en) * | 2015-06-25 | 2019-01-08 | Amazon Technologies, Inc. | User identification based on voice and face |
CN109147763B (zh) * | 2018-07-10 | 2020-08-11 | 深圳市感动智能科技有限公司 | 一种基于神经网络和逆熵加权的音视频关键词识别方法和装置 |
-
2020
- 2020-02-28 CN CN202010129952.9A patent/CN111312217A/zh active Pending
- 2020-04-27 WO PCT/CN2020/087115 patent/WO2021169023A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105389097A (zh) * | 2014-09-03 | 2016-03-09 | 中兴通讯股份有限公司 | 一种人机交互装置及方法 |
US20170236516A1 (en) * | 2016-02-16 | 2017-08-17 | Carnegie Mellon University, A Pennsylvania Non-Profit Corporation | System and Method for Audio-Visual Speech Recognition |
CN106328156A (zh) * | 2016-08-22 | 2017-01-11 | 华南理工大学 | 一种音视频信息融合的麦克风阵列语音增强系统及方法 |
CN109814718A (zh) * | 2019-01-30 | 2019-05-28 | 天津大学 | 一种基于Kinect V2的多模态信息采集系统 |
CN110111783A (zh) * | 2019-04-10 | 2019-08-09 | 天津大学 | 一种基于深度神经网络的多模态语音识别方法 |
CN110503957A (zh) * | 2019-08-30 | 2019-11-26 | 上海依图信息技术有限公司 | 一种基于图像去噪的语音识别方法及装置 |
CN110544479A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种去噪的语音识别方法及装置 |
CN110545396A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种基于定位去噪的语音识别方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN111312217A (zh) | 2020-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021169023A1 (zh) | 语音识别方法、装置、设备及存储介质 | |
Zhou et al. | Modality attention for end-to-end audio-visual speech recognition | |
Czyzewski et al. | An audio-visual corpus for multimodal automatic speech recognition | |
US7454342B2 (en) | Coupled hidden Markov model (CHMM) for continuous audiovisual speech recognition | |
US7472063B2 (en) | Audio-visual feature fusion and support vector machine useful for continuous speech recognition | |
US9123347B2 (en) | Apparatus and method for eliminating noise | |
Kinnunen et al. | Voice activity detection using MFCC features and support vector machine | |
TW502249B (en) | Segmentation approach for speech recognition systems | |
JP6501260B2 (ja) | 音響処理装置及び音響処理方法 | |
JP4220449B2 (ja) | インデキシング装置、インデキシング方法およびインデキシングプログラム | |
US10748544B2 (en) | Voice processing device, voice processing method, and program | |
WO2014117547A1 (en) | Method and device for keyword detection | |
JP6464005B2 (ja) | 雑音抑圧音声認識装置およびそのプログラム | |
WO2022121155A1 (zh) | 基于元学习的自适应语音识别方法、装置、设备及介质 | |
JP2011191423A (ja) | 発話認識装置、発話認識方法 | |
Almajai et al. | Using audio-visual features for robust voice activity detection in clean and noisy speech | |
CN111554279A (zh) | 一种基于Kinect的多模态人机交互系统 | |
Saenko et al. | Articulatory features for robust visual speech recognition | |
WO2023035969A1 (zh) | 语音与图像同步性的衡量方法、模型的训练方法及装置 | |
CN111027675B (zh) | 一种多媒体播放设置自动调节方法及系统 | |
Maka et al. | An analysis of the influence of acoustical adverse conditions on speaker gender identification | |
JP2011033879A (ja) | サンプルを用いずあらゆる言語を識別可能な識別方法 | |
Bagi et al. | Improved recognition rate of language identification system in noisy environment | |
Karpagavalli et al. | A hierarchical approach in tamil phoneme classification using support vector machine | |
Verma et al. | Performance analysis of speaker identification using gaussian mixture model and support vector machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20921398 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20921398 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.02.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20921398 Country of ref document: EP Kind code of ref document: A1 |