CN116129929A - Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium - Google Patents

Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116129929A
CN116129929A CN202211584453.4A CN202211584453A CN116129929A CN 116129929 A CN116129929 A CN 116129929A CN 202211584453 A CN202211584453 A CN 202211584453A CN 116129929 A CN116129929 A CN 116129929A
Authority
CN
China
Prior art keywords
auditory
network
visual
features
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211584453.4A
Other languages
Chinese (zh)
Inventor
胡晓林
李凯
苑克鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202211584453.4A priority Critical patent/CN116129929A/en
Publication of CN116129929A publication Critical patent/CN116129929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Social Psychology (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Psychiatry (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The present disclosure relates to an audio-visual voice separation method, apparatus, electronic device, and storage medium, which obtain an audible feature by feature extraction of video information including mixed audio data and image data of at least one object, and a visual feature corresponding to each image data. And inputting the visual characteristics corresponding to the target object and the auditory characteristics into a voice separation network together to obtain the target auditory characteristics of the target object so as to determine target audio data. The voice separation network comprises a visual sub-network for processing visual characteristics, an auditory sub-network for processing auditory characteristics and an output characteristic multi-mode fusion sub-network for integrating the visual sub-network and the auditory sub-network. The method and the device respectively process the visual characteristics and the auditory characteristics through the visual sub-network and the auditory sub-network, and transmit and integrate the visual characteristics and the auditory characteristics through the multi-mode fusion sub-network so as to accurately reconstruct target audio data based on the visual characteristics and the auditory characteristics.

Description

Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium
Technical Field
The disclosure relates to the field of computer technology, and in particular, to an audio-visual voice separation method, an audio-visual voice separation device, electronic equipment and a storage medium.
Background
The human auditory system has the innate ability to separate various audio signals, such as distinguishing multiple speaker voices or distinguishing voices from background noise. It enhances the sound of interest while suppressing other sounds, a biological phenomenon known as "cocktail party effect" and referred to in the computer as "speech separation". Effective communication with others in a noisy environment is also challenging for hearing impaired people. Also, the automatic voice recognition system of the computer is also severely affected by a noisy environment, so that the recognition accuracy is reduced, and it is difficult to extract required information from complex audio information.
Disclosure of Invention
In view of this, the present disclosure proposes an audiovisual voice separation method, apparatus, electronic device, and storage medium, which aim to accurately extract desired audio data from audio-video information including a variety of complex information.
According to a first aspect of the present disclosure there is provided a method of audiovisual speech separation, the method comprising:
acquiring video information to be processed, wherein the video information comprises mixed audio data and image data of at least one object;
Extracting mixed audio data in the video information and image data of each object;
extracting features of the mixed audio data to obtain auditory features, and extracting features of each image data to obtain corresponding visual features;
inputting the visual characteristics corresponding to the target object and the auditory characteristics into a trained voice separation network to obtain the target auditory characteristics of the target object, wherein the voice separation network comprises a visual sub-network, an auditory sub-network and a multi-mode fusion sub-network, the visual sub-network is used for processing the input visual characteristics, the auditory sub-network is used for processing the input auditory characteristics, and the multi-mode fusion sub-network is used for integrating the output characteristics of the visual sub-network and the auditory sub-network;
reconstructing target audio data of the target object according to the target auditory feature.
In one possible implementation, extracting the audio data in the video information and the image data of each object includes:
extracting an audio frame sequence in the video information to obtain audio data;
extracting a video frame sequence of the at least one object in the video information;
And determining image data of the corresponding object according to each video frame sequence.
In one possible implementation, extracting the video frame sequence of the at least one object in the video information includes:
identifying at least one object included in each video frame in the video information through a face recognition model obtained through pre-training;
marking a lip region of each of the objects in the video frame;
and determining a corresponding video frame sequence according to the lip areas of the same object in different video frames.
In one possible implementation manner, feature extraction is performed on the audio data to obtain an auditory feature, and feature extraction is performed on each image data to obtain a corresponding visual feature, including:
extracting the characteristics of the audio data through a filter bank to obtain auditory characteristics;
and respectively carrying out feature extraction on each image data according to a pre-trained lip reading model to obtain visual features of the corresponding object.
In one possible implementation, the lip read model includes a backbone network for feature extraction.
In one possible implementation manner, the inputting the visual feature corresponding to the target object and the hearing feature into the trained voice separation network to obtain the target hearing feature of the target object includes:
According to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, carrying out multiple feature extraction based on the visual features and the auditory features corresponding to the target object to obtain target auditory processing features;
and processing the target auditory processing characteristics for a plurality of times according to the auditory sub-network to obtain target auditory characteristics of the target object.
In one possible implementation manner, according to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, multiple feature extraction is performed based on the visual features and the auditory features corresponding to the target object, so as to obtain target auditory processing features, including:
the following procedure is performed in an iterative manner:
inputting the visual characteristics corresponding to the target object into the visual sub-network, inputting the auditory characteristics into the auditory sub-network, and respectively outputting visual intermediate characteristics and auditory intermediate characteristics;
integrating the visual intermediate features and the auditory intermediate features through the multi-mode fusion sub-network, and outputting the updated visual features and auditory features corresponding to the target object as inputs of the visual sub-network and the auditory sub-network in the next processing process;
And stopping the iterative processing process of the visual intermediate feature, the auditory intermediate feature, the visual feature and the auditory feature in response to the iteration number reaching a first preset iteration number, and determining the current auditory feature as a target auditory processing feature.
In one possible implementation manner, the processing the target auditory processing feature for multiple times according to the auditory sub-network to obtain a target auditory feature of the target object includes:
processing the target auditory processing characteristics for a plurality of times in an iterative manner through the auditory sub-network, wherein a processing result after each processing is used as input of the auditory sub-network in the next processing process;
and stopping the iterative processing process of the target auditory processing characteristics in response to the iteration times reaching a second preset iteration times, and determining that the last output characteristics of the auditory sub-network are the target auditory characteristics of the target object.
In one possible implementation manner, the integration processing is performed on the visual intermediate feature and the auditory intermediate feature through the multi-mode fusion sub-network, and updated visual features and auditory features are output, including:
The sizes of the visual intermediate features and the auditory intermediate features are adjusted through the multi-mode fusion sub-network by a nearest neighbor interpolation method, so that candidate visual features and candidate auditory features are obtained;
performing stitching processing on the visual intermediate features and the candidate auditory features to determine updated visual features;
and performing stitching processing on the auditory intermediate features and the candidate visual features to determine updated auditory features.
In one possible implementation, the visual sub-network and the auditory sub-network include a plurality of processing layers from top to bottom;
in the process of processing the characteristic information by the visual sub-network and/or the auditory sub-network, inputting information into a processing layer which is the lowest, and processing the information layer by layer from bottom to top to obtain multi-scale visual characteristics and/or multi-scale auditory characteristics;
outputting fusion visual features and/or fusion auditory features of the multi-scale visual and/or auditory features of the adjacent layers in a splicing mode;
and splicing the fusion visual features and/or the fusion auditory features from top to bottom, and outputting visual intermediate features and/or auditory intermediate features.
In one possible implementation, reconstructing the target audio data of the target object from the target auditory features includes:
Generating corresponding mask features according to the target auditory features;
obtaining corresponding voice features according to the mask features and the hearing features;
and reconstructing target audio data of the target object by processing the voice characteristics through an inverse filter bank.
According to a second aspect of the present disclosure, there is provided an audiovisual speech separation device, the device comprising:
the information acquisition module is used for acquiring video information to be processed, wherein the video information comprises mixed audio data and image data of at least one object;
the data extraction module is used for extracting mixed audio data in the video information and image data of each object;
the feature extraction module is used for carrying out feature extraction on the mixed audio data to obtain auditory features, and carrying out feature extraction on each image data to obtain corresponding visual features;
the feature separation module is used for inputting the visual features corresponding to the target object and the auditory features into the trained voice separation network to obtain the target auditory features of the target object, the voice separation network comprises a visual sub-network, an auditory sub-network and a multi-mode fusion sub-network, the visual sub-network is used for processing the input visual features, the auditory sub-network is used for processing the input auditory features, and the multi-mode fusion sub-network is used for integrating the output features of the visual sub-network and the auditory sub-network;
And the data conversion module is used for reconstructing target audio data of the target object according to the target auditory characteristics.
In one possible implementation, extracting the audio data in the video information and the image data of each object includes:
extracting an audio frame sequence in the video information to obtain audio data;
extracting a video frame sequence of the at least one object in the video information;
and determining image data of the corresponding object according to each video frame sequence.
In one possible implementation, extracting the video frame sequence of the at least one object in the video information includes:
identifying at least one object included in each video frame in the video information through a face recognition model obtained through pre-training;
marking a lip region of each of the objects in the video frame;
and determining a corresponding video frame sequence according to the lip areas of the same object in different video frames.
In one possible implementation manner, feature extraction is performed on the audio data to obtain an auditory feature, and feature extraction is performed on each image data to obtain a corresponding visual feature, including:
Extracting the characteristics of the audio data through a filter bank to obtain auditory characteristics;
and respectively carrying out feature extraction on each image data according to a pre-trained lip reading model to obtain visual features of the corresponding object.
In one possible implementation, the lip read model includes a backbone network for feature extraction.
In one possible implementation manner, the inputting the visual feature corresponding to the target object and the hearing feature into the trained voice separation network to obtain the target hearing feature of the target object includes:
according to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, carrying out multiple feature extraction based on the visual features and the auditory features corresponding to the target object to obtain target auditory processing features;
and processing the target auditory processing characteristics for a plurality of times according to the auditory sub-network to obtain target auditory characteristics of the target object.
In one possible implementation manner, according to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, multiple feature extraction is performed based on the visual features and the auditory features corresponding to the target object, so as to obtain target auditory processing features, including:
The following procedure is performed in an iterative manner:
inputting the visual characteristics corresponding to the target object into the visual sub-network, inputting the auditory characteristics into the auditory sub-network, and respectively outputting visual intermediate characteristics and auditory intermediate characteristics;
integrating the visual intermediate features and the auditory intermediate features through the multi-mode fusion sub-network, and outputting the updated visual features and auditory features corresponding to the target object as inputs of the visual sub-network and the auditory sub-network in the next processing process;
and stopping the iterative processing process of the visual intermediate feature, the auditory intermediate feature, the visual feature and the auditory feature in response to the iteration number reaching a first preset iteration number, and determining the current auditory feature as a target auditory processing feature.
In one possible implementation manner, the processing the target auditory processing feature for multiple times according to the auditory sub-network to obtain a target auditory feature of the target object includes:
processing the target auditory processing characteristics for a plurality of times in an iterative manner through the auditory sub-network, wherein a processing result after each processing is used as input of the auditory sub-network in the next processing process;
And stopping the iterative processing process of the target auditory processing characteristics in response to the iteration times reaching a second preset iteration times, and determining that the last output characteristics of the auditory sub-network are the target auditory characteristics of the target object.
In one possible implementation manner, the integration processing is performed on the visual intermediate feature and the auditory intermediate feature through the multi-mode fusion sub-network, and updated visual features and auditory features are output, including:
the sizes of the visual intermediate features and the auditory intermediate features are adjusted through the multi-mode fusion sub-network by a nearest neighbor interpolation method, so that candidate visual features and candidate auditory features are obtained;
performing stitching processing on the visual intermediate features and the candidate auditory features to determine updated visual features;
and performing stitching processing on the auditory intermediate features and the candidate visual features to determine updated auditory features.
In one possible implementation, the visual sub-network and the auditory sub-network include a plurality of processing layers from top to bottom;
in the process of processing the characteristic information by the visual sub-network and/or the auditory sub-network, inputting information into a processing layer which is the lowest, and processing the information layer by layer from bottom to top to obtain multi-scale visual characteristics and/or multi-scale auditory characteristics;
Outputting fusion visual features and/or fusion auditory features of the multi-scale visual and/or auditory features of the adjacent layers in a splicing mode;
and splicing the fusion visual features and/or the fusion auditory features from top to bottom, and outputting visual intermediate features and/or auditory intermediate features.
In one possible implementation, reconstructing the target audio data of the target object from the target auditory features includes:
generating corresponding mask features according to the target auditory features;
obtaining corresponding voice features according to the mask features and the hearing features;
and reconstructing target audio data of the target object by processing the voice characteristics through an inverse filter bank.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
In an embodiment of the disclosure, feature extraction is performed on video information including audio data and image data of at least one object to obtain an auditory feature and a visual feature corresponding to each image data. And inputting the visual characteristics corresponding to the target object and the auditory characteristics into a voice separation network together to obtain the target auditory characteristics of the target object so as to determine target audio data. The voice separation network comprises a visual sub-network for processing visual characteristics, an auditory sub-network for processing auditory characteristics and an output characteristic multi-mode fusion sub-network for integrating the visual sub-network and the auditory sub-network. The method and the device respectively process the visual characteristics and the auditory characteristics through the visual sub-network and the auditory sub-network, and transmit and integrate the visual characteristics and the auditory characteristics through the multi-mode fusion sub-network so as to accurately determine required target audio data based on the visual characteristics and the complex auditory characteristics.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 illustrates a flow chart of a method of audiovisual speech separation in accordance with an embodiment of the present disclosure;
FIG. 2 shows a schematic diagram of a voice separation network in accordance with an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a human brain processing audiovisual information in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of the architecture of an auditory sub-network and a visual sub-network according to an embodiment of the present disclosure;
FIG. 5 shows a schematic diagram of an audiovisual speech separation process in accordance with an embodiment of the disclosure;
fig. 6 shows a schematic diagram of an audiovisual speech separation device in accordance with an embodiment of the present disclosure;
FIG. 7 shows a schematic diagram of an electronic device according to an embodiment of the disclosure;
fig. 8 shows a schematic diagram of another electronic device according to an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
In one possible implementation, the audio-visual speech separation method of the embodiments of the present disclosure may be performed by an electronic device such as a processor, a terminal device, or a server. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. Fixed or mobile terminals. The server may be a single server or a server cluster composed of a plurality of servers. The electronic device may implement the audio-visual speech separation method of the embodiments of the present disclosure by way of the processor invoking computer readable instructions stored in the memory.
Fig. 1 shows a flow chart of a method of audio-visual speech separation according to an embodiment of the present disclosure. As shown in fig. 1, the audio-visual voice separation method of the embodiment of the present disclosure may include the following steps S10 to S50.
Step S10, obtaining video information to be processed.
In one possible implementation, the video information to be processed, i.e. the video information requiring audio-visual speech separation and reconstruction of the audio data required therein, is acquired by the electronic device. Wherein the video information may include mixed audio data and image data of at least one object. The initial mixed audio data included in the video information may be audio data including a plurality of mixed audio including audio data characterizing at least one object sound and disturbing audio data such as background sound. The object in the embodiments of the present disclosure may be a person, i.e., the video information may be a video recording the contents of a multi-person conversation. Optionally, the electronic device may acquire the video information to be processed by acquiring audio and video and images through a built-in or connected audio and video acquisition device. Or the electronic equipment can also receive the video information to be processed which is sent after the other equipment collects the video information.
Step S20, extracting mixed audio data in the video information and image data of each object.
In one possible implementation, the electronic device, after acquiring the video information, extracts the mixed audio data and the image data of each object included therein, respectively. Alternatively, the video information may include an audio frame sequence for recording mixed audio data and a video frame sequence for recording video data. The electronic device may first extract an audio frame sequence and a video frame sequence in the video information, and determine audio data and image data based on the audio frame sequence and the video frame sequence, respectively.
Alternatively, the electronic device may directly determine the sequence of audio frames as mixed audio data, i.e. may extract the sequence of audio frames in the video information to obtain mixed audio data. Meanwhile, after extracting the video frame sequences, the image data of each object is obtained through additional processing, namely the electronic equipment can firstly extract the video frame sequence of at least one object in the video information, and the image data of the corresponding object is determined according to each video frame sequence. The extracted audio data includes noise such as each object sound and other background sounds in the video information. For example, in the case where the video information contains C speaking characters, it can be determined that the mixed audio data in the video is linearly mixed, and the mixed audio data x can be expressed as
Figure BDA0003991375100000071
Wherein s is i Representing audio data corresponding to a speaker character, σ representing background noise. It is an object of embodiments of the present disclosure to capture a corresponding target speaker audio s in mixed audio data using image data of a target object i
Further, the video frame sequence of each object extracted by the electronic device at least includes a partial image area of the object, and the image area of the object can be determined according to the application scene. For example, in the case where the audio data corresponding to each object needs to be reconstructed, the electronic device may determine the image area as a lip area that varies with the pronunciation of the object. The electronic device may identify at least one object included in each video frame in the original sequence of video frames in the video information by a pre-trained face recognition model. And marking the lip area of each object in the video frame, and determining a corresponding video frame sequence according to the lip areas of the same object in different video frames. Optionally, the lip area of each object in the marked original video frame is the same in size, that is, it is finally determined that each object corresponds to the same size of each video frame in the video frame sequence. After determining the video frame sequence for each object, the electronic device may directly treat the video frame sequence as image data for the object.
And step S30, extracting features of the audio data to obtain auditory features, and extracting features of each image data to obtain corresponding visual features.
In one possible implementation, the electronic device may determine the audible and visual features by feature extracting the mixed audio data and the image data of each object after determining the mixed audio data and the image data of each object. Wherein the auditory features are speech features extracted based on audio for characterizing the talking content of the object in the audio data. The visual features are features extracted based on the lip images of the object and are used for representing the talking content of the object in the corresponding image data.
Because the formats and contents of the audio data and the image data are different, the electronic device can respectively extract the characteristics of the audio data and the image data in different modes. The electronic device may perform feature extraction on the audio data, for example, through a filter bank, resulting in audible features. And the feature extraction can be respectively carried out on each image data according to a pre-trained lip reading model, so as to obtain the visual features of the corresponding object. The filter bank can be a one-dimensional convolution layer, and the number of channels, the convolution kernel length and the step length of the convolution layer can be preset.
Alternatively, the lip reading model may be trained directly in the electronic device, or may be deployed in the electronic device after being trained in other devices. The trained lip reading model may include a backbone network for feature extraction. Image data is input into lip reading model and then passed throughAnd the backbone network performs feature extraction to obtain visual features of the corresponding object. Illustratively, a backbone network may include a three-dimensional convolutional layer and a standard ResNet-18 network. Since the image data of each object is a sequence of video frames consisting of a plurality of lip regions, for each video frame P in which the motion of each object's lips is characterized i Firstly, P three-dimensional convolution check video frames preset by the size and the step length are used for convolution to obtain corresponding characteristics F v . Further, F v Each frame in the video frame sequence is formed by a ResNet-18 network so as to form one-dimensional embedding, and an embedding matrix K formed by one-dimensional embedding of all video frames in the video frame sequence is generated by a backbone network i The conversation content of the corresponding object of the sequence of video frames is characterized.
And S40, inputting the visual characteristics corresponding to the target object and the auditory characteristics into a trained voice separation network to obtain the target auditory characteristics of the target object.
In one possible implementation manner, after the visual feature and the auditory feature of each object are obtained through feature extraction, the electronic device determines any object as a target object, so as to reconstruct the target auditory feature corresponding to the target object by using the visual feature and the auditory feature of the target object. Wherein the target object is an object of which a conversation content in the video information needs to be extracted, for example, in the case where the video information is a video of a process of collecting a conversation of a person a and a person B, the electronic device may determine that the target object is the person a in the case where the conversation content of the person a needs to be extracted, and determine that the target person is the person B in the case where the conversation content of the person B needs to be extracted. After the electronic equipment determines the target object, the visual characteristics and the auditory characteristics corresponding to the target object are input into a trained voice separation network, and the target auditory characteristics of the target object are extracted from the mixed auditory characteristics based on the visual characteristics of the target object through the voice separation network.
Optionally, the voice separation network may include a visual sub-network, an auditory sub-network, and a multimodal fusion sub-network. The multi-mode fusion sub-network is used for integrating the output characteristics of the visual sub-network and the auditory sub-network. The voice separation network structure of the embodiment of the disclosure can be set based on human brain structure simulation, when the human brain extracts information required in the mixed audio, the human brain respectively processes visual characteristics and auditory characteristics through two cortex, and then the information transmitted by the two cortex is integrated through thalamus so as to extract the information required in the mixed audio. Therefore, the visual sub-network and the auditory sub-network set based on human brain structure simulation play a role of cerebral cortex, and the multi-mode fusion sub-network plays a role of thalamus information integration.
In one possible implementation, the speech separation network may perform feature extraction multiple times based on the visual features and the auditory features of the input target object to obtain the target auditory features of the target object. Optionally, the feature extraction process may be performing feature extraction multiple times based on the visual features and the auditory features corresponding to the target object according to the visual sub-network, the auditory sub-network, and the multi-modal fusion sub-network, so as to obtain the target auditory processing features. And processing the target auditory processing characteristics for a plurality of times according to the auditory sub-network to obtain target auditory characteristics of the target object. Wherein the number of feature extraction in determining the target auditory processing feature process and the number of processing the target auditory processing feature may be the same or different.
Alternatively, the process of extracting features by using the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network may be an iterative process, in which visual features corresponding to the target object are input to the visual sub-network, auditory features are input to the auditory sub-network, and visual intermediate features and auditory intermediate features are output respectively. And integrating the visual intermediate features and the auditory intermediate features through the multi-mode fusion sub-network, and outputting updated visual features and auditory features. And the updated visual characteristics and the updated auditory characteristics are used as the inputs of the visual sub-network and the auditory sub-network in the next iteration process. And stopping the iterative processing process of the visual intermediate feature, the auditory intermediate feature, the visual feature and the auditory feature in response to the iteration times reaching the first preset times, and determining the current auditory feature as the target auditory processing feature. The first preset times are preset multi-mode processing fusion times.
Further, the process of extracting the target auditory processing features for multiple times by using the auditory sub-network to obtain the target auditory features of the target object may also be an iterative process. The auditory sub-network processes the target auditory processing characteristics in an iterative manner, and the processing result after each processing is used as the input of the auditory sub-network in the next processing process. And stopping the iterative processing process of the target auditory processing characteristics in response to the fact that the number of iterations reaches the second preset number, and determining the characteristics output by the auditory sub-network last time as the target auditory characteristics of the target object. The second preset times are preset processing times of the auditory sub-network.
In one possible implementation manner, the process of integrating the visual intermediate feature and the auditory intermediate feature by the multi-mode fusion sub-network is that the sizes of the visual intermediate feature and the auditory intermediate feature are adjusted by a nearest neighbor interpolation method through the multi-mode fusion sub-network, so as to obtain candidate visual features and candidate auditory features. And then, splicing the vision intermediate features and the candidate hearing features to determine updated vision features. The auditory intermediate features and the candidate visual features are stitched to determine updated auditory features. I.e. V at the current visual and auditory intermediate features, respectively i,t And A i,t Under the condition of (1), the vision intermediate feature and the hearing intermediate feature are integrated through the multi-mode fusion sub-network, and updated vision features are obtained
Figure BDA0003991375100000081
Auditory characteristics of->
Figure BDA0003991375100000082
Wherein R is nearest neighbor interpolation function, F represents series or summation function,/or->
Figure BDA0003991375100000083
And->
Figure BDA0003991375100000084
Two different fully connected layers.
Fig. 2 shows a schematic diagram of a voice separation network in accordance with an embodiment of the present disclosure. As shown in fig. 2, after determining the auditory feature and the visual feature corresponding to the target object, the electronic device inputs the auditory feature into the auditory sub-network, and the visual feature is input into the visual sub-network to perform feature extraction respectively. The auditory sub-network and the visual sub-network respectively output corresponding auditory intermediate features and visual intermediate features, the auditory intermediate features and the visual intermediate features in the iteration process are subjected to audio-visual feature fusion through the multi-mode fusion sub-network, updated visual features and auditory features are obtained, and then the updated visual features and the updated auditory features are input into the visual sub-network and the visual sub-network to be subjected to the next iteration process. After n iterative processes, ending the iterative processes and determining the auditory characteristics output by the multi-mode fusion sub-network in the last iterative process as target auditory processing characteristics, and after m iterative processes are carried out on the target auditory processing characteristics through the auditory sub-network, ending the iterative processes and determining the final output of the auditory sub-network in the last iterative process as the target auditory characteristics. The target auditory feature is used to characterize the mixed auditory feature as a feature that characterizes the content of the target subject's utterance. n and m are respectively a first preset number of times and a second preset number of times.
Fig. 3 shows a schematic diagram of a human brain processing audiovisual information in accordance with an embodiment of the present disclosure. As shown in fig. 3, when the human brain processes audio-visual information, the human brain acquires audio features and vision through the primary auditory thalamus and the primary visual thalamus, and then sequentially processes the audio features and the vision features from the low-level cortex to the high-level cortex through the multi-level auditory cortex and the multi-level visual cortex. The processing result is further transmitted to the advanced auditory/visual thalamus for feature fusion processing.
According to the embodiment of the disclosure, the auditory sub-network structure and the visual sub-network structure can be designed based on the cortical structure of the human brain, so that the information extraction accuracy of the auditory sub-network and the visual sub-network is improved in a manner of simulating the cerebral cortical structure, namely, multiple processing layers can be arranged in the auditory sub-network and the visual sub-network, the feature extraction is sequentially carried out layer by layer after the corresponding auditory features and visual features are input, and the processed multiple layers of results are output.
Fig. 4 illustrates a schematic diagram of the architecture of an auditory sub-network and a visual sub-network according to an embodiment of the present disclosure. As shown in fig. 4, the visual sub-network and the auditory sub-network include a plurality of processing layers from top to bottom. In the process of processing the characteristic information by the visual sub-network and/or the auditory sub-network, the input information is input into the processing layer at the lowest position and is processed layer by layer from bottom to top, so as to obtain the multi-scale visual characteristic and/or the multi-scale auditory characteristic. The multi-scale visual and/or auditory features of adjacent layers are output with fused visual features and/or fused auditory features using a stitching approach. And splicing and fusing the visual features and/or the auditory features from top to bottom, and outputting the visual intermediate features and/or the auditory intermediate features. Each layer in the visual sub-network and the auditory sub-network can comprise a plurality of processing units, and each processing unit comprises three connection modes from bottom to top, from top to bottom and from side to side. The connection is realized by a bottom-up connection mode through a maximum pooling mode or a convolution mode with the step length larger than 1, and the connection is realized by an up-sampling mode such as interpolation or transposition convolution. Both the upsampling and downsampling connections described above may be done in adjacent phases and non-adjacent phases, in which case the connection is referred to as a cross-layer connection. In addition, the lateral connection of two processing units in the same processing layer is used to mimic the synaptic connection between different neurons in the same sensory region, which can be achieved by convolution.
In an alternative implementation, the lip reading model structure of the disclosed embodiments is also the same as the vision subnetwork for improving the accuracy of the extracted vision features.
And S50, reconstructing target audio data of the target object according to the target auditory characteristics.
In one possible implementation, after determining the target hearing characteristics of the target object through the speech separation network, the electronic device may determine target audio data of the target object according to the target hearing characteristics, that is, obtain audio content of a portion of the video information in which the target object speaks. Alternatively, the process of audio data extraction may be to generate corresponding mask features from the target auditory features. And obtaining the voice characteristics of the corresponding target object according to the mask characteristics and the auditory characteristics, processing the voice characteristics through an inverse filter bank, and reconstructing target audio data of the target object. Wherein the auditory characteristics are obtained using a filter. Processing the target auditory feature through the full connection layer and the nonlinear energy activation function ReLU to generate a mask feature of the target object, and multiplying the mask feature and the auditory feature element by element to obtain the voice feature of the target object. Alternatively, in the case where it is necessary to acquire the content of a characteristic person speaking in a multi-person conversation video, the target object is a specific person.
Alternatively, the inverse filter bank may be a one-dimensional transpose convolution layer for recovering the speech features of the input target object into a time-domain signal. Wherein the super-parameters of the inverse filter bank are consistent with the super-parameter settings of the filter bank for extracting auditory features.
Fig. 5 shows a schematic diagram of an audiovisual speech separation process in accordance with an embodiment of the present disclosure. As shown in fig. 5, the electronic apparatus extracts image data of each person object and mixed audio data of the entire video after acquiring video information including two person objects. And further extracting the characteristics of the mixed audio data through a filter bank to obtain auditory characteristics, and respectively extracting the characteristics of the image data of each human object through a lip reading model (namely a lip characteristic extraction model) to obtain corresponding visual characteristics. The electronic device may input the hearing feature and the visual feature of each person object into the trained voice separation network (i.e., fusion processing component) after obtaining the hearing feature and the visual feature of each person object, and the voice separation network extracts the target hearing feature of the person object according to the visual feature of the person object input together. And then processing the audio data based on the target auditory characteristics through an inverse filter bank to reconstruct target audio data of the person object, namely recording the audio of the speaking content of the person object in the video information.
Based on the technical characteristics, the embodiment of the disclosure can respectively process the visual characteristics and the auditory characteristics through the visual sub-network and the auditory sub-network, and transmit and integrate the visual characteristics and the auditory characteristics through the multi-mode fusion sub-network so as to accurately extract required target audio data from video information based on the visual characteristics and the mixed auditory characteristics. Meanwhile, the structure of the medium voice separation network and the structures of the visual sub-network and the auditory sub-network are both inspired by the brain, so that multi-mode integration widely occurring in the sensory thalamus and the cerebral cortex can be simulated, and the performance and efficiency of voice separation are improved.
Fig. 6 shows a schematic diagram of an audiovisual speech separation device in accordance with an embodiment of the present disclosure. As shown in fig. 6, an audio-visual voice separation apparatus of an embodiment of the present disclosure includes:
an information acquisition module 60, configured to acquire video information to be processed, where the video information includes mixed audio data and image data of at least one object;
a data extraction module 61 for extracting mixed audio data in the video information, and image data of each of the objects;
the feature extraction module 62 is configured to perform feature extraction on the mixed audio data to obtain an auditory feature, and perform feature extraction on each image data to obtain a corresponding visual feature;
The feature separation module 63 is configured to input a visual feature corresponding to a target object and the auditory feature into a trained voice separation network to obtain a target auditory feature of the target object, where the voice separation network includes a visual sub-network, an auditory sub-network, and a multi-modal fusion sub-network, the visual sub-network is configured to process the input visual feature, the auditory sub-network is configured to process the input auditory feature, and the multi-modal fusion sub-network is configured to integrate output features of the visual sub-network and the auditory sub-network;
a data conversion module 64 for reconstructing target audio data of the target object from the target auditory characteristics.
In one possible implementation, extracting the audio data in the video information and the image data of each object includes:
extracting an audio frame sequence in the video information to obtain audio data;
extracting a video frame sequence of the at least one object in the video information;
and determining image data of the corresponding object according to each video frame sequence.
In one possible implementation, extracting the video frame sequence of the at least one object in the video information includes:
Identifying at least one object included in each video frame in the video information through a face recognition model obtained through pre-training;
marking a lip region of each of the objects in the video frame;
and determining a corresponding video frame sequence according to the lip areas of the same object in different video frames.
In one possible implementation manner, feature extraction is performed on the audio data to obtain an auditory feature, and feature extraction is performed on each image data to obtain a corresponding visual feature, including:
extracting the characteristics of the audio data through a filter bank to obtain auditory characteristics;
and respectively carrying out feature extraction on each image data according to a pre-trained lip reading model to obtain visual features of the corresponding object.
In one possible implementation, the lip read model includes a backbone network for feature extraction.
In one possible implementation manner, the inputting the visual feature corresponding to the target object and the hearing feature into the trained voice separation network to obtain the target hearing feature of the target object includes:
according to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, carrying out multiple feature extraction based on the visual features and the auditory features corresponding to the target object to obtain target auditory processing features;
And processing the target auditory processing characteristics for a plurality of times according to the auditory sub-network to obtain target auditory characteristics of the target object.
In one possible implementation manner, according to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, multiple feature extraction is performed based on the visual features and the auditory features corresponding to the target object, so as to obtain target auditory processing features, including:
the following procedure is performed in an iterative manner:
inputting the visual characteristics corresponding to the target object into the visual sub-network, inputting the auditory characteristics into the auditory sub-network, and respectively outputting visual intermediate characteristics and auditory intermediate characteristics;
integrating the visual intermediate features and the auditory intermediate features through the multi-mode fusion sub-network, and outputting the updated visual features and auditory features corresponding to the target object as inputs of the visual sub-network and the auditory sub-network in the next processing process;
and stopping the iterative processing process of the visual intermediate feature, the auditory intermediate feature, the visual feature and the auditory feature in response to the iteration number reaching a first preset iteration number, and determining the current auditory feature as a target auditory processing feature.
In one possible implementation manner, the processing the target auditory processing feature for multiple times according to the auditory sub-network to obtain a target auditory feature of the target object includes:
processing the target auditory processing characteristics for a plurality of times in an iterative manner through the auditory sub-network, wherein a processing result after each processing is used as input of the auditory sub-network in the next processing process;
and stopping the iterative processing process of the target auditory processing characteristics in response to the iteration times reaching a second preset iteration times, and determining that the last output characteristics of the auditory sub-network are the target auditory characteristics of the target object.
In one possible implementation manner, the integration processing is performed on the visual intermediate feature and the auditory intermediate feature through the multi-mode fusion sub-network, and updated visual features and auditory features are output, including:
the sizes of the visual intermediate features and the auditory intermediate features are adjusted through the multi-mode fusion sub-network by a nearest neighbor interpolation method, so that candidate visual features and candidate auditory features are obtained;
performing stitching processing on the visual intermediate features and the candidate auditory features to determine updated visual features;
And performing stitching processing on the auditory intermediate features and the candidate visual features to determine updated auditory features.
In one possible implementation, the visual sub-network and the auditory sub-network include a plurality of processing layers from top to bottom;
in the process of processing the characteristic information by the visual sub-network and/or the auditory sub-network, inputting information into a processing layer which is the lowest, and processing the information layer by layer from bottom to top to obtain multi-scale visual characteristics and/or multi-scale auditory characteristics;
outputting fusion visual features and/or fusion auditory features of the multi-scale visual and/or auditory features of the adjacent layers in a splicing mode;
and splicing the fusion visual features and/or the fusion auditory features from top to bottom, and outputting visual intermediate features and/or auditory intermediate features.
In one possible implementation, reconstructing the target audio data of the target object from the target auditory features includes:
generating corresponding mask features according to the target auditory features;
obtaining corresponding voice features according to the mask features and the hearing features;
and reconstructing target audio data of the target object by processing the voice characteristics through an inverse filter bank.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
Fig. 7 shows a schematic diagram of an electronic device 800 according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 7, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.
Fig. 8 shows a schematic diagram of another electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to fig. 8, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (14)

1. A method of audio-visual speech separation, the method comprising:
acquiring video information to be processed, wherein the video information comprises mixed audio data and image data of at least one object;
extracting mixed audio data in the video information and image data of each object;
extracting features of the mixed audio data to obtain auditory features, and extracting features of each image data to obtain corresponding visual features;
inputting the visual characteristics corresponding to the target object and the auditory characteristics into a trained voice separation network to obtain the target auditory characteristics of the target object, wherein the voice separation network comprises a visual sub-network, an auditory sub-network and a multi-mode fusion sub-network, the visual sub-network is used for processing the input visual characteristics, the auditory sub-network is used for processing the input auditory characteristics, and the multi-mode fusion sub-network is used for integrating the output characteristics of the visual sub-network and the auditory sub-network;
Reconstructing target audio data of the target object according to the target auditory feature.
2. The method of claim 1, wherein extracting audio data in the video information and image data for each of the objects comprises:
extracting an audio frame sequence in the video information to obtain audio data;
extracting a video frame sequence of the at least one object in the video information;
and determining image data of the corresponding object according to each video frame sequence.
3. The method of claim 2, wherein extracting the sequence of video frames of the at least one object in the video information comprises:
identifying at least one object included in each video frame in the video information through a face recognition model obtained through pre-training;
marking a lip region of each of the objects in the video frame;
and determining a corresponding video frame sequence according to the lip areas of the same object in different video frames.
4. The method of claim 1, wherein the extracting the audio data to obtain the auditory feature, and the extracting each image data to obtain the corresponding visual feature, respectively, comprises:
Extracting the characteristics of the audio data through a filter bank to obtain auditory characteristics;
and respectively carrying out feature extraction on each image data according to a pre-trained lip reading model to obtain visual features of the corresponding object.
5. The method of claim 4, wherein the lip read model includes a backbone network for feature extraction.
6. The method of claim 1, wherein inputting the visual features corresponding to the target object into the trained speech separation network along with the auditory features to obtain the target auditory features of the target object comprises:
according to the visual sub-network, the auditory sub-network and the multi-mode fusion sub-network, carrying out multiple feature extraction based on the visual features and the auditory features corresponding to the target object to obtain target auditory processing features;
and processing the target auditory processing characteristics for a plurality of times according to the auditory sub-network to obtain target auditory characteristics of the target object.
7. The method of claim 6, wherein performing feature extraction multiple times based on the visual features and the auditory features corresponding to the target object according to the visual sub-network, the auditory sub-network, and the multi-modal fusion sub-network to obtain target auditory processing features, comprises:
The following procedure is performed in an iterative manner:
inputting the visual characteristics corresponding to the target object into the visual sub-network, inputting the auditory characteristics into the auditory sub-network, and respectively outputting visual intermediate characteristics and auditory intermediate characteristics;
integrating the visual intermediate features and the auditory intermediate features through the multi-mode fusion sub-network, and outputting the updated visual features and auditory features corresponding to the target object as inputs of the visual sub-network and the auditory sub-network in the next processing process;
and stopping the iterative processing process of the visual intermediate feature, the auditory intermediate feature, the visual feature and the auditory feature in response to the iteration number reaching a first preset iteration number, and determining the current auditory feature as a target auditory processing feature.
8. The method according to claim 6 or 7, wherein the processing the target auditory processing characteristics multiple times according to the auditory sub-network to obtain target auditory characteristics of the target object comprises:
processing the target auditory processing characteristics for a plurality of times in an iterative manner through the auditory sub-network, wherein a processing result after each processing is used as input of the auditory sub-network in the next processing process;
And stopping the iterative processing process of the target auditory processing characteristics in response to the iteration times reaching a second preset iteration times, and determining that the last output characteristics of the auditory sub-network are the target auditory characteristics of the target object.
9. The method of claim 7, wherein integrating the visual intermediate features and the auditory intermediate features through the multimodal fusion sub-network outputs updated visual features and auditory features, comprising:
the sizes of the visual intermediate features and the auditory intermediate features are adjusted through the multi-mode fusion sub-network by a nearest neighbor interpolation method, so that candidate visual features and candidate auditory features are obtained;
performing stitching processing on the visual intermediate features and the candidate auditory features to determine updated visual features;
and performing stitching processing on the auditory intermediate features and the candidate visual features to determine updated auditory features.
10. The method of claim 1, wherein the visual sub-network and the auditory sub-network comprise a plurality of processing layers from top to bottom;
in the process of processing the characteristic information by the visual sub-network and/or the auditory sub-network, inputting information into a processing layer which is the lowest, and processing the information layer by layer from bottom to top to obtain multi-scale visual characteristics and/or multi-scale auditory characteristics;
Outputting fusion visual features and/or fusion auditory features of the multi-scale visual and/or auditory features of the adjacent layers in a splicing mode;
and splicing the fusion visual features and/or the fusion auditory features from top to bottom, and outputting visual intermediate features and/or auditory intermediate features.
11. The method of claim 1, wherein reconstructing the target audio data of the target object from the target auditory features comprises:
generating corresponding mask features according to the target auditory features;
obtaining corresponding voice features according to the mask features and the hearing features;
and reconstructing target audio data of the target object by processing the voice characteristics through an inverse filter bank.
12. An audiovisual speech separation device, the device comprising:
the information acquisition module is used for acquiring video information to be processed, wherein the video information comprises mixed audio data and image data of at least one object;
the data extraction module is used for extracting mixed audio data in the video information and image data of each object;
the feature extraction module is used for carrying out feature extraction on the mixed audio data to obtain auditory features, and carrying out feature extraction on each image data to obtain corresponding visual features;
The feature separation module is used for inputting the visual features corresponding to the target object and the auditory features into the trained voice separation network to obtain the target auditory features of the target object, the voice separation network comprises a visual sub-network, an auditory sub-network and a multi-mode fusion sub-network, the visual sub-network is used for processing the input visual features, the auditory sub-network is used for processing the input auditory features, and the multi-mode fusion sub-network is used for integrating the output features of the visual sub-network and the auditory sub-network;
and the data conversion module is used for reconstructing target audio data of the target object according to the target auditory characteristics.
13. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any one of claims 1 to 11 when executing the instructions stored by the memory.
14. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 11.
CN202211584453.4A 2022-12-09 2022-12-09 Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium Pending CN116129929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211584453.4A CN116129929A (en) 2022-12-09 2022-12-09 Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211584453.4A CN116129929A (en) 2022-12-09 2022-12-09 Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116129929A true CN116129929A (en) 2023-05-16

Family

ID=86301925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211584453.4A Pending CN116129929A (en) 2022-12-09 2022-12-09 Audio-visual voice separation method, audio-visual voice separation device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116129929A (en)

Similar Documents

Publication Publication Date Title
JP6986167B2 (en) Image generation methods and devices, electronic devices and storage media
TWI759647B (en) Image processing method, electronic device, and computer-readable storage medium
CN107944409B (en) Video analysis method and device capable of distinguishing key actions
CN109257645B (en) Video cover generation method and device
CN110517185B (en) Image processing method, device, electronic equipment and storage medium
CN108198569B (en) Audio processing method, device and equipment and readable storage medium
CN111783756B (en) Text recognition method and device, electronic equipment and storage medium
CN111553864B (en) Image restoration method and device, electronic equipment and storage medium
CN109887515B (en) Audio processing method and device, electronic equipment and storage medium
CN110458218B (en) Image classification method and device and classification network training method and device
CN111242303B (en) Network training method and device, and image processing method and device
CN110931028B (en) Voice processing method and device and electronic equipment
CN116129931B (en) Audio-visual combined voice separation model building method and voice separation method
JP2022533065A (en) Character recognition methods and devices, electronic devices and storage media
CN112967730A (en) Voice signal processing method and device, electronic equipment and storage medium
CN110633715B (en) Image processing method, network training method and device and electronic equipment
CN111583142B (en) Image noise reduction method and device, electronic equipment and storage medium
CN110675355B (en) Image reconstruction method and device, electronic equipment and storage medium
CN109903252B (en) Image processing method and device, electronic equipment and storage medium
CN113113044B (en) Audio processing method and device, terminal and storage medium
CN112820300B (en) Audio processing method and device, terminal and storage medium
CN111933171B (en) Noise reduction method and device, electronic equipment and storage medium
CN111553865B (en) Image restoration method and device, electronic equipment and storage medium
CN111507131B (en) Living body detection method and device, electronic equipment and storage medium
CN109635926B (en) Attention feature acquisition method and device for neural network and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination