CN113782048A - Multi-modal voice separation method, training method and related device - Google Patents

Multi-modal voice separation method, training method and related device Download PDF

Info

Publication number
CN113782048A
CN113782048A CN202111122074.9A CN202111122074A CN113782048A CN 113782048 A CN113782048 A CN 113782048A CN 202111122074 A CN202111122074 A CN 202111122074A CN 113782048 A CN113782048 A CN 113782048A
Authority
CN
China
Prior art keywords
lip
network
voice
feature extraction
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111122074.9A
Other languages
Chinese (zh)
Other versions
CN113782048B (en
Inventor
潘峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111122074.9A priority Critical patent/CN113782048B/en
Priority claimed from CN202111122074.9A external-priority patent/CN113782048B/en
Publication of CN113782048A publication Critical patent/CN113782048A/en
Application granted granted Critical
Publication of CN113782048B publication Critical patent/CN113782048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a multi-modal voice separation method, a training method and a related device, wherein the multi-modal voice separation method comprises the following steps: obtaining lip video information and audio information containing a target user; obtaining the voice of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained. By the mode, the generalization of the multi-modal voice separation network to various languages and dialects can be improved.

Description

Multi-modal voice separation method, training method and related device
Technical Field
The application belongs to the technical field of voice recognition, and particularly relates to a multi-modal voice separation method, a training method and a related device.
Background
With the continuous development of human-computer interaction modes, from traditional touch interaction, voice interaction and existing multi-mode human-computer interaction, the characteristics of high efficiency, convenience, comfort, safety and the like brought by the method become new pursuits of users. As one of the most important technologies of the multi-modal front end, multi-modal speech separation has become a hot spot for researchers in related fields. Among them, the effect of multimodal speech separation, the operation efficiency, and the versatility in various languages and dialects are the most central issues.
Disclosure of Invention
The application provides a multi-modal voice separation method, a training method and a related device, which are used for improving the generalization of a multi-modal voice separation network to various languages and dialects.
In order to solve the technical problem, the application adopts a technical scheme that: provided is a multi-modal speech separation method including: obtaining lip video information and audio information containing a target user; obtaining the voice of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained.
In order to solve the above technical problem, another technical solution adopted by the present application is: a multi-modal speech separation network training method is provided, which comprises the following steps: training a first lip recognition network comprising a lip feature extraction sub-network by adopting an unsupervised training set; updating parameters of a lip feature extraction sub-network in a second lip recognition network by using the trained parameters of the lip feature extraction sub-network in the first lip recognition network, and training the second lip recognition network by adopting a supervised training set; and updating the parameters of the lip feature extraction sub-network in the multi-modal voice separation network by using the trained parameters of the lip feature extraction sub-network in the second lip recognition network, and training the multi-modal voice separation network by adopting a separation network training set.
In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a multimodal speech isolation apparatus comprising: the first obtaining module is used for obtaining lip video information and audio information containing a target user; the second obtaining module is used for obtaining the voice of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained.
In order to solve the above technical problem, the present application adopts another technical solution: there is provided an electronic device comprising a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the multimodal speech separation method described in any of the above embodiments or the multimodal speech separation network training method described in any of the above embodiments.
In order to solve the above technical problem, the present application adopts another technical solution: there is provided a memory device storing program instructions executable by a processor to implement the multimodal speech separation method described in any of the above embodiments, or the multimodal speech separation network training method described in any of the above embodiments.
Being different from the prior art situation, the beneficial effect of this application is: in the multi-modal voice separation method provided by the application, before the multi-modal voice separation network is trained, an unsupervised training set is adopted to perform unsupervised training on the lip-shaped feature extraction sub-network in the multi-modal voice separation network. Although the training data in the unsupervised training set is label-free, the target of lip-shaped feature training is still phoneme, and the training process is still carried out on a feature level; after the lip-shaped feature extraction sub-network is subjected to a large amount of unsupervised data pre-training, the generalization of the model to various languages and dialects can be increased, and overfitting is avoided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:
FIG. 1 is a schematic flow chart diagram of an embodiment of a multi-modal speech separation method according to the present application;
FIG. 2 is a schematic diagram of a network structure of an embodiment of the multimodal speech separation network shown in FIG. 1;
FIG. 3 is a schematic flowchart illustrating an embodiment of a multi-modal speech separation network training method according to the present application;
fig. 4 is a schematic network structure diagram of an embodiment of the first lip identification network of fig. 3;
FIG. 5 is a schematic diagram of a network structure of an embodiment of the second lip recognition network of FIG. 3;
FIG. 6 is a schematic diagram of a framework structure of an embodiment of the multi-modal speech separation apparatus according to the present application;
FIG. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present application;
fig. 8 is a schematic structural diagram of an embodiment of a memory device according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a multi-modal speech separation method according to the present application, where the multi-modal speech separation method specifically includes:
s101: lip video information and audio information containing the target user are obtained.
Specifically, in this embodiment, the specific implementation process of the step S101 may include: acquiring video data and audio data acquired by a video acquisition device (such as a camera) and an audio acquisition device (such as a microphone) for a target user respectively; extracting lip video information from the video data, and obtaining a mixed voice feature sequence from the audio data, and using the mixed voice feature sequence as audio information.
Specifically, for video data, video face cutting can be completed through a face key feature point detection tool to obtain a lip region video image of a target user, a cv2 self-contained function RGB2Grey is used for converting a three-channel RGB image into a single-channel gray image, and then lip video information of the target user is obtained. The process of obtaining the lip video information can remove redundant information in the video data, and is beneficial to extracting the follow-up lip characteristics. For audio data, short-time fourier transform and filterbank domain conversion may be performed on the audio data to extract a magnitude spectrum, which may be referred to as a mixed speech feature sequence. When the audio data comprises multi-channel data, the phase difference characteristics among the channels can be added into the amplitude spectrum extracted at the moment.
S102: obtaining the voice of a target user according to the lip video information, the audio information and the trained multi-mode voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained.
Specifically, the training process mentioned in the above step S102 will be described in detail later.
Optionally, in this embodiment, a specific implementation process of the step S102 may be:
A. and obtaining the first voice existence probability of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network.
Referring to fig. 2, fig. 2 is a schematic network structure diagram of an embodiment of the multimodal speech separation network in fig. 1. The input of the multi-mode voice separation network 1 is lip video information and audio information, and the output is a first voice existence probability mask of a target user; and the multimodal speech separation network 1 comprises a lip feature extraction sub-network 10, the input of the lip feature extraction sub-network 10 is lip video information, and the lip video information can be extracted to obtain lip features corresponding to each video frame, and the lip features corresponding to a plurality of video frames form a lip feature sequence. Alternatively, in the present embodiment, the lip feature extraction sub-network 10 includes a pseudo 3D convolution (e.g., P3D conv) and a residual network (e.g., ResNet 18) connected to each other, and the lip video information is input to the pseudo 3D convolution. In general, 3D convolution is adopted in the existing sub-network for extracting lip features, and the method of adopting pseudo 3D convolution instead of 3D convolution in the present application can reduce the amount of parameters and calculations of the sub-network 10 for extracting lip features in the multi-modal speech separation network 1, so that the whole system can quickly land and be widely used.
Of course, with continued reference to fig. 2, the multimodal speech separation network 1 includes, in addition to the above-mentioned lip feature extraction sub-network 10, an encoding layer 12, a fusion layer 14, and a separation sub-network 16; alternatively, the disjoint sub-network 16 may comprise a CNN, a GRU and an FC connected to each other. Further, the specific process of implementing the step S102 by the multimodal speech separation network 1 includes: the lip feature extraction sub-network 10 receives lip video information and performs feature extraction on the lip video information to obtain a lip feature sequence; the input of the coding layer 12 is the output of the lip-shaped feature extraction sub-network 10, the coding layer 12 receives the lip-shaped feature sequence and codes the lip-shaped feature sequence, and the purpose of the introduction of the coding layer 12 is to complete mode conversion and automatically adapt to the audio-visual frequency offset frame; the input of the fusion layer 14 is a lip-shaped feature sequence and audio information (i.e. a mixed voice feature sequence), and the fusion layer 14 is configured to fuse the encoded lip-shaped feature sequence and the audio information to obtain a fusion feature sequence; the input of the separation subnetwork 16 is the output of the fusion layer 14, and the separation subnetwork 16 obtains the first speech presence probability mask from the fusion feature sequence mapping. In addition, with continued reference to fig. 2, before the audio information in the multimodal speech separation network 1 is inputted to the fusion layer 14, the audio information may also pass through the CNN convolutional neural network to further extract features.
B. And obtaining the voice of the target user from the audio information according to the first voice existence probability. Specifically, the specific implementation process of step B may be: and multiplying the first voice existence probability mask and the audio information, and performing filter bank2frequency and reverse short-time Fourier transform to obtain the voice of the separated target user.
In the foregoing embodiment, before training the multimodal speech separation network, the lip feature extraction sub-network in the multimodal speech separation network is unsupervised trained by using an unsupervised training set in the multimodal speech separation method provided by the present application. Although the training data in the unsupervised training set is label-free, the target of lip-shaped feature training is still phoneme, and the training process is still carried out on a feature level; after the lip-shaped feature extraction sub-network is subjected to a large amount of unsupervised data pre-training, the generalization of the model to various languages and dialects can be increased, and overfitting is avoided.
The above steps S101 to S102 are mainly described in an application level, and a training process will be described below. Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a multi-modal speech separation network training method according to an embodiment of the present application, where the training process mainly includes:
s201: a first lip recognition network comprising a sub-network of lip feature extractions is trained using an unsupervised training set.
Specifically, in this embodiment, the preparation of the training data may be performed before the step S201. Specifically, audio and video with consistent lip sounds can be obtained through actual collection or by using an open source data set, and the audio and video comprises dialects of various types and languages. Collected audio-video data can be roughly divided into three main categories: 1) chinese audio and video data and English audio and video data with text labels; optionally, in this embodiment, since the usage range of chinese and english is relatively wide, the data labeled with text is preferentially the data labeled with chinese or english text. 2) Chinese audio and video data without text labels and English audio and video data. 3) Audio and video data of various dialects and other languages without text labels; for example, audio and video data in the cantonese language, audio and video data in the Sichuan language, audio and video data in the German language, audio and video data in the French language, etc. The supervised training set used in step S202 may be constructed by the chinese audio/video data and the english audio/video data with text labels, and the unsupervised training set used in step S201 may be constructed by the chinese audio/video data and the english audio/video data without text labels. And the separate network training set used in step S203 may be constructed from all the above-mentioned audio and video data. In addition, it should be noted that, in other embodiments, the languages corresponding to the unsupervised training set and the supervised training set may also be other languages other than chinese and english, which is not limited in this application. The audio and video with consistent lip sounds comprises a single lip video and a corresponding single voice, and the meaning expressed by the unit lip video is consistent with the meaning expressed by the unit voice. And the acquisition process for the single-person lip video can be as follows: finishing video face cutting by a face key feature point detection tool to obtain a lip region video image, converting a three-channel RGB image into a single-channel gray image by using a cv2 self-carrying function RGB2Grey, and further obtaining a single-person lip video.
Through the preparation work of the training data, the constructed unsupervised training set comprises a plurality of groups of single voice and single lip videos which correspond to each other, and the specific implementation process of the step S201 comprises the following steps: A. the method includes performing feature extraction on a single-person lip video to obtain a lip feature sequence, and performing feature extraction on a single-person voice corresponding to the single-person lip video to obtain a first audio feature sequence. B. Obtaining a loss according to the lip feature sequence and the first audio feature sequence, and adjusting parameters of a lip feature extraction sub-network in the first lip recognition network according to the loss. Optionally, in general, a second audio feature sequence corresponding to the lip feature sequence may be obtained from the lip feature sequence, a mean square error LOSS function MSE is further used to obtain a LOSS MSE LOSS between the second audio feature sequence and the first audio feature sequence obtained by extraction, and then parameters of a lip feature extraction sub-network in the first lip recognition network may be adjusted according to the LOSS. And the criteria for stopping training for the first lip recognition network may be: the present application does not limit this in response to the loss obtained from the lip-shaped feature sequence and the first audio feature sequence converging, or the number of training rounds reaching a preset round, etc.
Optionally, referring to fig. 4, fig. 4 is a schematic network structure diagram of an embodiment of the first lip identification network in fig. 3. The first lip recognition network 2 comprises a lip feature extraction subnetwork 10 and a speech feature extraction subnetwork 22; the lip feature extraction sub-network 10 is used for performing feature extraction on the single lip video to obtain a lip feature sequence; optionally, in this embodiment, the lip feature extraction sub-network 10 in the first lip identification network 2 has the same structure as in fig. 2, i.e. comprises a pseudo 3D convolution and residual network connected to each other. The voice feature extraction sub-network 22 is used for carrying out feature extraction on single voice corresponding to the single lip video to obtain a first audio feature sequence; and the parameters in the voice feature extraction sub-network 22 are trained and fixed in advance, that is, in the training process of the first lip recognition network 2, the parameters of the voice feature extraction sub-network 22 are not changed, and the specific network structure of the voice feature extraction sub-network 22 may refer to any one of the prior art, which is not described herein again. The network architecture of the first lip identification network 2 is simple, and the calculated amount is small; and fixing the voice feature extraction subnetwork 22 prior to training the first lip recognition network 2 may result in an improved training accuracy of the lip feature extraction subnetwork 10.
The above-mentioned process of pre-training through the unsupervised training set can improve the accuracy of the lip-shape feature extraction sub-network 10 identification and the generalization of the model.
S202: and updating the parameters of the lip feature extraction sub-network in the second lip recognition network by using the trained parameters of the lip feature extraction sub-network in the first lip recognition network, and training the second lip recognition network by adopting a supervised training set.
Specifically, in this embodiment, the second lip recognition network includes the same sub-network for extracting lip features as the first lip recognition network, and the initial values of the parameters in the sub-network for extracting lip features in the second lip recognition network are the parameter values of the sub-network for extracting lip features in the trained first lip recognition network.
Optionally, through the preparation work of the training data, the constructed supervised training set comprises a plurality of single-person lip videos, and each video frame in the single-person lip videos is provided with a corresponding acoustic label; wherein video frames with lip similarity exceeding a threshold have the same acoustic label. In this embodiment, each single lip video in the supervised training set has a corresponding single voice, and the single voice has a corresponding text label; before the step S202, a process of setting an acoustic tag may also be performed, which may specifically include: performing forced-alignment (FA) on the single-person lip video and the corresponding text labels to obtain a phoneme label of each video frame; the multiple phoneme labels with lip similarity exceeding the threshold are set as the same acoustic label, and the size of the specific threshold can be set according to actual requirements.
In the prior art, the setting mode of the acoustic tag corresponding to the single lip video is as follows: and performing forced-alignment (FA) on the single-person lip video and the corresponding text labels to obtain a phoneme label of each video frame, and using the phoneme label as an acoustic label of the corresponding video frame. In fact, however, the phoneme label resulting from the forced alignment of the video of a single person's lips with the corresponding text labels is redundant for lip recognition: for example, the "b", "p" and "m" in the Chinese initials have the same lip shape, the "d", "t", "l" and "n" have the same lip shape, the "zh", "ch" and "sh" have the same lip shape, the "ā i", "a i" and "a i" in the finals have the same lip shape, and so on. In other words, one lip feature may correspond to multiple phoneme labels. Such phoneme label classification is therefore not reasonable. The method improves the method, the phoneme labels are clustered according to the lip state or the viseme (viseme), and the multiple phoneme labels with the clustered lip similarity exceeding the threshold value are endowed with the same acoustic label again, so that the number of nodes of the lip recognition network, the parameter quantity and the training difficulty can be greatly reduced. For example, when the single lip video is in a chinese language or an english language, clustering may be performed in a tri-phone dictionary which is a chinese-english general phoneme modeling method, and different phoneme labels with similar lip shapes are combined in the clustering process. For example, the same acoustic label is clustered with "b", "p", and "m", the same acoustic label is clustered with "d", "t", "l", and "n", the same acoustic label is clustered with "zh", "ch", and "sh", and the like.
Further, the specific implementation process of training the second lip recognition network by using the supervised training set in step S202 includes: A. performing feature extraction on the single lip video to obtain lip features of each video frame; B. obtaining a corresponding prediction label according to the lip-shaped feature of each video frame; C. a loss (e.g., cross-entropy loss) between the predicted tag and the corresponding acoustic tag is obtained, and a parameter of the second lip recognition network is adjusted based on the loss. While the criteria for stopping training for the second lip recognition network may be: the loss convergence between the predicted tag and the corresponding acoustic tag, or the training times reaching the preset round, etc., are not limited in this application. The process of training the second lip recognition network by adopting the supervised training set is simple and mature, and the accuracy of the lip recognition subnetwork can be improved. In addition, before the step a, image data enhancement, such as image rotation, pixel value disturbance, noise addition and the like, can be performed on the single-person lip video to enhance the robustness of the model.
Optionally, referring to fig. 5, fig. 5 is a schematic network structure diagram of an embodiment of the second lip recognition network in fig. 3. The second lip recognition network 3 comprises a lip feature extraction sub-network 10 and a full connectivity layer 32; wherein, the lip feature extraction sub-network 10 is used for performing feature extraction on the single lip video to obtain lip features of each video frame; the full connection layer 32 is used for obtaining a corresponding prediction tag according to the lip-shaped feature of each video frame; the total number of the acoustic labels clustered by all phonemes in the language corresponding to the single lip video according to the lip similarity is the same as the total number of the nodes in the full connection layer 32. That is, the number of nodes in the full connection layer 32 in the second lip recognition network 3 provided by the present application is much smaller than that in the prior art, and the number of nodes is set according to the total number of acoustic tags after all phonemes in the language corresponding to the single lip video are clustered according to the lip similarity. The design mode can greatly reduce the number of nodes, the parameter quantity and the training difficulty of the model lip shape recognition network. In addition, in the process of adjusting the parameters of the second lip recognition network 3, the parameters of the lip-like feature extraction sub-network 10 and the full connection layer 32 may be adjusted together.
S203: and updating the parameters of the lip feature extraction sub-network in the multi-mode voice separation network by using the parameters of the lip feature extraction sub-network in the trained second lip recognition network, and training the multi-mode voice separation network by adopting a separation network training set.
Specifically, in this embodiment, the multi-modal voice separation network includes the same sub-network for extracting lip features as in the first lip recognition network, and at this time, the initial value of the parameter in the sub-network for extracting lip features in the multi-modal voice separation network is the parameter value of the sub-network for extracting lip features in the second lip recognition network after training.
Optionally, through the preparation of the training data, the constructed separation network training set includes a plurality of groups of single-person lip videos and mixed voice feature sequences, and the mixed voice feature sequences in the same group include features of single-person voices corresponding to the single-person lip videos, and each mixed voice feature sequence is provided with a voice tag. In this embodiment, the separation network training set includes various dialects and audio/video data corresponding to various languages, and this way can enhance the generalization of the multi-modal voice separation network to various dialects and various languages, and achieve independence from dialects and languages.
Optionally, before step S203, a process of setting a voice tag of each mixed voice feature sequence is further included, which specifically includes: A. and (3) performing data cleaning on all audio obtained in the training data preparation work to remove background noise, reverberation and the like in the audio, and performing end point detection VAD detection to set the amplitude of the non-voice section to be 0. B. Constructing a multi-channel speech by using all audio and noise after cleaningA mixed voice; the construction mode of the mixed voice can be as follows: mixing a single voice with a noise voice; or mixing a plurality of single voices with the noise voices; wherein the noisy speech may be retrieved from a noise bank. C. Carrying out feature extraction on the mixed voice to obtain a mixed voice feature sequence; specifically, feature extraction may be performed on the mixed speech using Filterbank, MCFF, or the like. In this embodiment, the mixed speech feature sequence may be a magnitude spectrum of a Filterbank domain. D. Setting a voice tag corresponding to the mixed voice feature sequence; the specific process can be as follows: and obtaining a sum value of the energy of all single voices and the energy of the noise voices in the mixed voice, and taking the ratio of the energy of the single voices corresponding to the current single lip video to the sum value as a voice tag IRM. When the task is a single voice enhancement task, the calculation formula of the voice tag IRM is as follows: IRM ═ S2/(S2+N2) (ii) a Wherein S is2For the energy of a single voice in mixed speech, N2Energy of noise speech in the mixed speech; when the task is a multi-person voice separation task, the calculation formula of the voice tag IRM is as follows: IRM ═ S1 2/(S1 2+S2 2+…+Sn 2+N2) (ii) a Wherein S is1 2Energy of a single person' S voice as the first speaker in a mixed voice, S2 2Energy of a single person' S voice as a second speaker in the mixed speech, Sn 2Is the energy of the single voice of the nth speaker in the mixed voice, N2Is the energy of the noisy speech in the mixed speech.
Further, referring to fig. 2, the training of the multimodal speech separation network by using the separation network training set in step S203 specifically includes: A. performing feature extraction on the single-person lip video to obtain a lip feature sequence; in particular, lip feature extraction sub-network 10 may be employed for feature extraction of single person lip video. B. The lip-shaped feature sequence is coded, and the coded lip-shaped feature sequence and the mixed voice feature sequence are fused to obtain a fusion feature sequence; in particular, the encoding process may be implemented by the encoding layer 12 and the fusing process may be implemented by the fusing layer 14. C. Obtaining a second voice existence probability of the same target corresponding to the single-person lip video according to the fusion feature sequence; in particular, the process of obtaining the second speech presence probability may be implemented by the separate sub-network 16. D. Obtaining loss by utilizing the second voice existence probability and the voice tag, and adjusting parameters in the multi-modal voice separation network according to the loss; in particular, the parameters of the lip feature extraction sub-network 10, the encoding layer 12, the fusion layer 14, and the separation sub-network 16 of FIG. 2 may be adjusted. When a CNN convolutional neural network is further provided before the fusion layer 14 in fig. 2, parameters of the CNN convolutional neural network may also be adjusted together. While the criteria for stopping training for a multimodal speech separation network may be: in response to the convergence of the loss obtained by using the second voice existence probability and the voice tag, or the number of times of training reaches a preset round, etc., this is not limited in the present application. The process of the multi-mode voice separation network 1 is simple and mature, and is easy to implement. In the training process, the lip embedding output from the lip feature extraction sub-network 10 is input into the voice separation network as auxiliary information, and the multi-modal voice separation network 1 can learn the clean voice tag of the target speaker more easily by mining lip sounds to be consistent.
Optionally, when the mixed speech includes speech from different channels, before the step of fusing the encoded lip feature sequence and the mixed speech feature sequence in step B to obtain a fused feature sequence, the method includes: mixing the mixed voice feature sequence and phase difference features among a plurality of channels in the mixed voice; and the subsequent fusion process is to fuse the coded lip-shaped feature sequence and the mixed voice feature sequence mixed with the phase difference feature. The method can assist the training process of the multi-modal voice separation network by utilizing spatial information of different channels. In addition, the second voice existence probability mask of the multi-modal voice separation network 1 can guide multiple channels to perform adaptive beamforming, such as MVDR, GEVD, and the like, and can obtain a better separation effect by using spatial information. The improvement brought by multi-modal voice separation on multi-channel voice separation can solve the voice separation when spatial information is unavailable.
In summary, in the training process, on one hand, data with text labels (e.g., chinese and english data with text labels) is used as a supervised training set, and a large amount of data without text labels (e.g., chinese and english data without text labels) is used as an unsupervised training set, so that the accuracy of the lip-shaped feature extraction sub-network on a specific language (e.g., chinese and english) and the generalization of the model are improved; furthermore, when the multi-modal voice separation network is trained, multi-modal data of various languages and dialects which are difficult to obtain by labels are used for training, so that the generalization of the multi-modal voice separation network to the various languages and the dialects is enhanced. On the other hand, in the process of supervised training, the phonemes with the lip similarity exceeding the threshold are clustered, different phonemes with similar lip states are combined, the number of classification nodes and the difficulty of model learning are reduced, and the number of parameters of the model is reduced. On the other hand, the lip-shaped feature extraction sub-network adopts a mode of adopting pseudo-3D convolution to replace the existing 3D convolution so as to further reduce the parameter quantity and the calculated quantity of the multi-modal voice separation network, and the rapid landing and the wide application of the whole system are facilitated.
Referring to fig. 6, fig. 6 is a schematic diagram of a framework structure of an embodiment of the multi-modal speech separation apparatus 4 of the present application, wherein the multi-modal speech separation apparatus includes a first obtaining module 40 and a second obtaining module 42. The first obtaining module 40 is configured to obtain lip video information and audio information including a target user; the second obtaining module 42 is connected to the first obtaining module 40, and is configured to obtain the voice of the target user according to the lip video information, the audio information, and the trained multi-modal voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained.
In an embodiment, the second obtaining module 40 is specifically configured to obtain the first speech existence probability of the target user according to the lip video information, the audio information, and the trained multi-modal speech separation network; and obtaining the voice of the target user from the audio information according to the first voice existence probability.
In another embodiment, please continue to refer to fig. 6, the multi-modal speech separation apparatus may further include a training module 44, wherein the training module 44 is connected to the second obtaining module 42, and specifically includes a first training sub-module, a second training sub-module, and a third training sub-module; the first training submodule is used for training a first lip recognition network containing a lip feature extraction sub-network by adopting an unsupervised training set; the second training submodule is used for updating the parameters of the lip feature extraction sub-network in the second lip recognition network by using the trained parameters of the lip feature extraction sub-network in the first lip recognition network, and training the second lip recognition network by adopting a supervised training set; and the third training submodule is used for updating the parameters of the lip feature extraction sub-network in the multi-modal voice separation network by using the trained parameters of the lip feature extraction sub-network in the second lip recognition network, and training the multi-modal voice separation network by adopting a separation network training set.
In an application scene, the unsupervised training set comprises a plurality of groups of single voice and single lip videos which correspond to each other; the step of training the first lip recognition network including the sub-network for extracting lip features by using the unsupervised training set in the first training sub-module specifically includes: performing feature extraction on the single-person lip video to obtain a lip feature sequence, and performing feature extraction on single-person voice corresponding to the single-person lip video to obtain an audio feature sequence; and obtaining loss according to the lip feature sequence and the audio feature sequence, and adjusting parameters of the lip feature extraction sub-network in the first lip identification network according to the loss.
Optionally, in this embodiment, the first lip recognition network includes a lip feature extraction sub-network and a voice feature extraction sub-network; the lip feature extraction sub-network is used for performing feature extraction on the single lip video to obtain a lip feature sequence, and the voice feature extraction sub-network is used for performing feature extraction on the single voice corresponding to the single lip video to obtain an audio feature sequence; and the parameters in the speech feature extraction sub-network are trained and fixed in advance.
In yet another application scenario, the supervised training set includes a plurality of single-person lip videos, each video frame in the single-person lip videos is provided with a corresponding acoustic tag; wherein the video frames with lip similarity exceeding the threshold have the same acoustic label; the step of training the second lip shape recognition network by adopting the supervised training set in the second training submodule specifically comprises: performing feature extraction on the single lip video to obtain lip features of each video frame; obtaining a corresponding prediction label according to the lip-shaped feature of each video frame; and obtaining the loss between the prediction tag and the corresponding acoustic tag, and adjusting the parameters of the second lip recognition network according to the loss.
Optionally, in this embodiment, the second lip recognition network includes a lip feature extraction sub-network and a full connection layer; the lip feature extraction sub-network is used for carrying out feature extraction on the single lip video to obtain lip features of each video frame; the full connection layer is used for obtaining a corresponding prediction label according to the lip-shaped feature of each video frame; the total number of the acoustic labels clustered by all phonemes in the language corresponding to the single lip video according to the lip similarity is the same as the total number of the nodes in the full connection layer.
Further, each single lip video in the supervised training set has a corresponding single voice, and the single voice has a text label; the multi-modal voice separation device provided by the application can further comprise a first setting module, connected with the second training submodule and used for forcibly aligning the single lip video and the corresponding text labels to obtain the phoneme label of each video frame before the second training submodule trains; and setting a plurality of phoneme labels with lip similarity exceeding a threshold value as the same acoustic label.
In another application scenario, the separation network training set comprises a plurality of groups of single-person lip videos and mixed voice feature sequences, the mixed voice feature sequences in the same group contain the features of single-person voice corresponding to the single-person lip videos, and each mixed voice feature sequence is provided with a voice tag; the step of training the multimodal speech separation network by using the separation network training set in the third training submodule specifically includes: performing feature extraction on the single-person lip video to obtain a lip feature sequence; the lip-shaped feature sequence is coded, and the coded lip-shaped feature sequence and the mixed voice feature sequence are fused to obtain a fusion feature sequence; obtaining a second voice existence probability of the same target corresponding to the single-person lip video according to the fusion feature sequence; and obtaining the loss by utilizing the second voice existence probability and the voice tag, and adjusting parameters in the multi-modal voice separation network according to the loss.
Optionally, the multimodal speech separation network includes a lip feature extraction sub-network, an encoding layer, a fusion layer, and a separation sub-network; the lip feature extraction sub-network is used for extracting features of the single lip video to obtain a lip feature sequence, the coding layer is used for coding the lip feature sequence, the fusion layer is used for fusing the coded lip feature sequence and the mixed voice feature sequence to obtain a fusion feature sequence, and the separation sub-network is used for obtaining a second voice existence probability of the same target corresponding to the single lip video according to the fusion feature sequence.
Further, the multi-modal voice separation device provided by the application can further comprise a second setting module connected with the third training sub-module, and configured to obtain a sum of the energy of all single voices and the energy of noise voices in the mixed voice before the training of the third training sub-module, and use a ratio of the energy of the single voice corresponding to the current single lip video to the sum as a voice tag.
In addition, the mixed speech may include speech from different channels, and the third training sub-module, before the step of fusing the encoded lip feature sequence and the mixed speech feature sequence to obtain a fused feature sequence, includes: and mixing the mixed voice feature sequence and the phase difference features between the channels in the mixed voice.
In addition, it should be noted that the above mentioned sub-network for extracting lip-like features includes a pseudo 3D convolution and residual network connected to each other; the languages corresponding to the unsupervised training set and the supervised training set may include Chinese and English, and the languages corresponding to the separate training set may include Chinese, English and dialect. Of course, in other embodiments, the languages corresponding to the unsupervised training set and the supervised training set may be other, which is not limited in this application.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present application, where the electronic device specifically includes: a memory 50 and a processor 52 coupled to each other, the memory 50 having program instructions stored therein, the processor 52 being configured to execute the program instructions to implement the steps of any one of the above-mentioned multi-modal speech separation methods, or the steps of the multi-modal speech separation network training method in any one of the above-mentioned embodiments. Specifically, electronic devices include, but are not limited to: desktop computers, notebook computers, tablet computers, servers, etc., without limitation thereto. Further, the processor 52 may also be referred to as a CPU (Central Processing Unit). Processor 52 may be an integrated circuit chip having signal processing capabilities. The Processor 52 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 52 may be commonly implemented by an integrated circuit chip.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an embodiment of a storage device 60 of the present application, in which a program instruction 62 capable of being executed by a processor is stored in the storage device 60, and the program instruction 62 is used to implement steps in any one of the above-mentioned multimodal speech separation methods or steps in any one of the above-mentioned embodiments of the multimodal speech separation network training method.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims (17)

1. A method for multimodal speech separation, comprising:
obtaining lip video information and audio information containing a target user;
obtaining the voice of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained.
2. The method of claim 1, wherein the step of obtaining the voice of the target user from the lip video information, the audio information, and the trained multimodal voice separation network comprises:
obtaining a first voice existence probability of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network;
and obtaining the voice of the target user from the audio information according to the first voice existence probability.
3. The method of claim 1, wherein the training process of the multimodal speech separation network comprises:
training a first lip recognition network comprising the lip feature extraction sub-network by using an unsupervised training set;
updating parameters of a lip feature extraction sub-network in a second lip recognition network by using the trained parameters of the lip feature extraction sub-network in the first lip recognition network, and training the second lip recognition network by adopting a supervised training set;
and updating the parameters of the lip feature extraction sub-network in the multi-modal voice separation network by using the trained parameters of the lip feature extraction sub-network in the second lip recognition network, and training the multi-modal voice separation network by adopting a separation network training set.
4. The method of multimodal speech separation according to claim 3 wherein the unsupervised training set comprises a plurality of sets of mutually corresponding single speech and single lip video; the step of training a first lip recognition network comprising the sub-network of lip feature extractions using an unsupervised training set comprises:
performing feature extraction on the single-person lip video to obtain a lip feature sequence, and performing feature extraction on single-person voice corresponding to the single-person lip video to obtain an audio feature sequence;
obtaining a loss according to the lip feature sequence and the audio feature sequence, and adjusting parameters of the lip feature extraction sub-network in the first lip recognition network according to the loss.
5. The multi-modal speech separation method of claim 4,
the first lip recognition network comprises the lip feature extraction sub-network and a voice feature extraction sub-network; wherein the lip feature extraction sub-network is used for performing feature extraction on the single lip video to obtain a lip feature sequence, and the voice feature extraction sub-network is used for performing feature extraction on the single voice corresponding to the single lip video to obtain an audio feature sequence; and the parameters in the speech feature extraction sub-network are trained and fixed in advance.
6. The method of multi-modal speech separation of claim 3 wherein the supervised training set comprises a plurality of single person lip videos, each video frame in the single person lip videos being provided with a corresponding acoustic tag; wherein video frames with lip similarity exceeding a threshold have the same acoustic tag; the step of training the second lip shape recognition network by adopting a supervised training set comprises the following steps:
performing feature extraction on the single-person lip video to obtain lip features of each video frame;
obtaining a corresponding prediction label according to the lip-shaped feature of each video frame;
and obtaining the loss between the prediction tag and the corresponding acoustic tag, and adjusting the parameters of the second lip recognition network according to the loss.
7. The multimodal speech isolation method of claim 6 wherein,
the second lip recognition network comprises the lip feature extraction sub-network and a full connection layer; wherein the lip feature extraction sub-network is used for performing feature extraction on the single lip video to obtain lip features of each video frame; the full connection layer is used for obtaining a corresponding prediction label according to the lip-shaped feature of each video frame; and the total number of the acoustic labels clustered by all phonemes in the language corresponding to the single lip video according to the lip similarity is the same as the total number of the nodes in the full connection layer.
8. The method of multimodal speech separation according to claim 5 wherein each of the single person lip videos in the supervised training set has a corresponding single person speech, and the single person speech has a text annotation; before the step of extracting features of the single person lip video to obtain lip features of each video frame, the method comprises the following steps:
forcibly aligning the single lip video and the corresponding text labels to obtain a phoneme label of each video frame;
and setting a plurality of phoneme labels with lip similarity exceeding a threshold value as the same acoustic label.
9. The multimodal speech separation method according to claim 3, wherein the separation network training set comprises a plurality of groups of single-person lip videos and mixed speech feature sequences, and the mixed speech feature sequences are obtained by feature extraction of mixed speech, the mixed speech contains single-person speech corresponding to the single-person lip videos, and each mixed speech feature sequence is provided with a speech tag; the step of training the multimodal speech separation network using a separation network training set comprises:
performing feature extraction on the single-person lip video to obtain a lip feature sequence;
the lip-shaped feature sequence is coded, and the coded lip-shaped feature sequence and the mixed voice feature sequence are fused to obtain a fused feature sequence;
obtaining a second voice existence probability of the same target corresponding to the single-person lip video according to the fusion feature sequence;
and obtaining loss by utilizing the second voice existence probability and the voice tag, and adjusting parameters in the multi-modal voice separation network according to the loss.
10. The multi-modal speech separation method of claim 9,
the multi-modal voice separation network comprises the lip feature extraction sub-network, an encoding layer, a fusion layer and a separation sub-network;
the lip feature extraction sub-network is used for performing feature extraction on the single-person lip video to obtain a lip feature sequence, the coding layer is used for coding the lip feature sequence, the fusion layer is used for fusing the coded lip feature sequence and the mixed voice feature sequence to obtain a fusion feature sequence, and the separation sub-network is used for obtaining a second voice existence probability of the same target corresponding to the single-person lip video according to the fusion feature sequence.
11. The method of multimodal speech isolation according to claim 9, wherein the step of feature extracting the single person lip video to obtain a lip feature sequence is preceded by the step of:
and obtaining a sum of the energy of all single voices and the energy of noise voices in the mixed voice, and taking the ratio of the energy of the single voice corresponding to the current single lip video to the sum as the voice tag.
12. The method according to claim 9, wherein the mixed speech contains speech from different channels, and the step of fusing the lip feature sequence and the mixed speech feature sequence after encoding to obtain a fused feature sequence comprises:
and mixing the mixed voice feature sequence and phase difference features between a plurality of channels in the mixed voice.
13. The multimodal speech isolation method of claim 1,
the lip feature extraction sub-network includes a pseudo-3D convolution and residual network connected to each other.
14. A multi-modal voice separation network training method is characterized by comprising the following steps:
training a first lip recognition network comprising a lip feature extraction sub-network by adopting an unsupervised training set;
updating parameters of a lip feature extraction sub-network in a second lip recognition network by using the trained parameters of the lip feature extraction sub-network in the first lip recognition network, and training the second lip recognition network by adopting a supervised training set;
and updating the parameters of the lip feature extraction sub-network in the multi-modal voice separation network by using the trained parameters of the lip feature extraction sub-network in the second lip recognition network, and training the multi-modal voice separation network by adopting a separation network training set.
15. A multimodal speech isolation apparatus comprising:
the first obtaining module is used for obtaining lip video information and audio information containing a target user;
the second obtaining module is used for obtaining the voice of the target user according to the lip video information, the audio information and the trained multi-mode voice separation network; the multi-modal voice separation network comprises a lip feature extraction sub-network, and the lip feature extraction sub-network is pre-trained by adopting an unsupervised training set before the multi-modal voice separation network is trained.
16. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the multimodal speech separation method of any of claims 1 to 13 or the multimodal speech separation network training method of claim 14.
17. A memory device storing program instructions executable by a processor to implement the multimodal speech isolation method of any of claims 1 to 13 or the multimodal speech isolation network training method of claim 14.
CN202111122074.9A 2021-09-24 Multi-mode voice separation method, training method and related device Active CN113782048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111122074.9A CN113782048B (en) 2021-09-24 Multi-mode voice separation method, training method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111122074.9A CN113782048B (en) 2021-09-24 Multi-mode voice separation method, training method and related device

Publications (2)

Publication Number Publication Date
CN113782048A true CN113782048A (en) 2021-12-10
CN113782048B CN113782048B (en) 2024-07-09

Family

ID=

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114245280A (en) * 2021-12-20 2022-03-25 清华大学深圳国际研究生院 Scene self-adaptive hearing aid audio enhancement system based on neural network
CN114420124A (en) * 2022-03-31 2022-04-29 北京妙医佳健康科技集团有限公司 Speech recognition method
CN114863916A (en) * 2022-04-26 2022-08-05 北京小米移动软件有限公司 Speech recognition model training method, speech recognition device and storage medium
CN117475360A (en) * 2023-12-27 2024-01-30 南京纳实医学科技有限公司 Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374494A1 (en) * 2017-06-23 2018-12-27 Casio Computer Co., Ltd. Sound source separation information detecting device capable of separating signal voice from noise voice, robot, sound source separation information detecting method, and storage medium therefor
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
WO2019079972A1 (en) * 2017-10-24 2019-05-02 深圳和而泰智能控制股份有限公司 Specific sound recognition method and apparatus, and storage medium
CN110111783A (en) * 2019-04-10 2019-08-09 天津大学 A kind of multi-modal audio recognition method based on deep neural network
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
US20200023856A1 (en) * 2019-08-30 2020-01-23 Lg Electronics Inc. Method for controlling a vehicle using speaker recognition based on artificial intelligent
CN111326143A (en) * 2020-02-28 2020-06-23 科大讯飞股份有限公司 Voice processing method, device, equipment and storage medium
CN111370020A (en) * 2020-02-04 2020-07-03 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111462733A (en) * 2020-03-31 2020-07-28 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111916054A (en) * 2020-07-08 2020-11-10 标贝(北京)科技有限公司 Lip-based voice generation method, device and system and storage medium
WO2020232867A1 (en) * 2019-05-21 2020-11-26 平安科技(深圳)有限公司 Lip-reading recognition method and apparatus, computer device, and storage medium
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN113035227A (en) * 2021-03-12 2021-06-25 山东大学 Multi-modal voice separation method and system
CN113314094A (en) * 2021-05-28 2021-08-27 北京达佳互联信息技术有限公司 Lip-shaped model training method and device and voice animation synthesis method and device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374494A1 (en) * 2017-06-23 2018-12-27 Casio Computer Co., Ltd. Sound source separation information detecting device capable of separating signal voice from noise voice, robot, sound source separation information detecting method, and storage medium therefor
WO2019079972A1 (en) * 2017-10-24 2019-05-02 深圳和而泰智能控制股份有限公司 Specific sound recognition method and apparatus, and storage medium
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN110111783A (en) * 2019-04-10 2019-08-09 天津大学 A kind of multi-modal audio recognition method based on deep neural network
WO2020232867A1 (en) * 2019-05-21 2020-11-26 平安科技(深圳)有限公司 Lip-reading recognition method and apparatus, computer device, and storage medium
WO2020252922A1 (en) * 2019-06-21 2020-12-24 平安科技(深圳)有限公司 Deep learning-based lip reading method and apparatus, electronic device, and medium
US20200023856A1 (en) * 2019-08-30 2020-01-23 Lg Electronics Inc. Method for controlling a vehicle using speaker recognition based on artificial intelligent
CN111370020A (en) * 2020-02-04 2020-07-03 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111326143A (en) * 2020-02-28 2020-06-23 科大讯飞股份有限公司 Voice processing method, device, equipment and storage medium
CN111462733A (en) * 2020-03-31 2020-07-28 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111916054A (en) * 2020-07-08 2020-11-10 标贝(北京)科技有限公司 Lip-based voice generation method, device and system and storage medium
CN113035227A (en) * 2021-03-12 2021-06-25 山东大学 Multi-modal voice separation method and system
CN112951258A (en) * 2021-04-23 2021-06-11 中国科学技术大学 Audio and video voice enhancement processing method and model
CN113314094A (en) * 2021-05-28 2021-08-27 北京达佳互联信息技术有限公司 Lip-shaped model training method and device and voice animation synthesis method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾振堂;: "由嘴唇视频直接生成语音的研究", 计算机应用研究, no. 06, 31 December 2020 (2020-12-31), pages 296 - 300 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114245280A (en) * 2021-12-20 2022-03-25 清华大学深圳国际研究生院 Scene self-adaptive hearing aid audio enhancement system based on neural network
CN114245280B (en) * 2021-12-20 2023-06-23 清华大学深圳国际研究生院 Scene self-adaptive hearing aid audio enhancement system based on neural network
CN114420124A (en) * 2022-03-31 2022-04-29 北京妙医佳健康科技集团有限公司 Speech recognition method
CN114420124B (en) * 2022-03-31 2022-06-24 北京妙医佳健康科技集团有限公司 Speech recognition method
CN114863916A (en) * 2022-04-26 2022-08-05 北京小米移动软件有限公司 Speech recognition model training method, speech recognition device and storage medium
CN117475360A (en) * 2023-12-27 2024-01-30 南京纳实医学科技有限公司 Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
CN117475360B (en) * 2023-12-27 2024-03-26 南京纳实医学科技有限公司 Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN

Similar Documents

Publication Publication Date Title
Pepino et al. Emotion recognition from speech using wav2vec 2.0 embeddings
Shillingford et al. Large-scale visual speech recognition
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
Zhao et al. Hearing lips: Improving lip reading by distilling speech recognizers
US11508366B2 (en) Whispering voice recovery method, apparatus and device, and readable storage medium
Deng et al. Recognizing emotions from whispered speech based on acoustic feature transfer learning
JP2020112787A (en) Real-time voice recognition method based on cutting attention, device, apparatus and computer readable storage medium
Fenghour et al. Deep learning-based automated lip-reading: A survey
Pandey et al. Liptype: A silent speech recognizer augmented with an independent repair model
Margam et al. LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models
EP4392972A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
Sarhan et al. HLR-net: a hybrid lip-reading model based on deep convolutional neural networks
CN115762489A (en) Data processing system and method of voice recognition model and voice recognition method
CN112597841A (en) Emotion analysis method based on door mechanism multi-mode fusion
Zhang et al. Cacnet: Cube attentional cnn for automatic speech recognition
Ballard et al. A multimodal learning interface for word acquisition
CN113505611A (en) Training method and system for obtaining better speech translation model in generation of confrontation
Hwang et al. Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
CN113782048B (en) Multi-mode voice separation method, training method and related device
Seong et al. A review of audio-visual speech recognition
CN113782048A (en) Multi-modal voice separation method, training method and related device
Ivanko et al. A novel task-oriented approach toward automated lip-reading system implementation
CN113889088A (en) Method and device for training speech recognition model, electronic equipment and storage medium
CN114121018A (en) Voice document classification method, system, device and storage medium
Choudhury et al. Review of Various Machine Learning and Deep Learning Techniques for Audio Visual Automatic Speech Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant