CN118101988A

CN118101988A - Video processing method, system and electronic equipment

Info

Publication number: CN118101988A
Application number: CN202410508447.3A
Authority: CN
Inventors: 吴彪; 夏日升; 唐巍
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-05-28

Abstract

The application relates to the technical field of video processing, and provides a video processing method, a system and electronic equipment, wherein the method comprises the following steps: acquiring image features and voiceprint features of a target user; based on the image characteristics and the voiceprint characteristics, acquiring a target image and target audio of a target user from a first media stream by using a diffusion network; a second media stream is generated based on the target image and the target audio. Therefore, the diffusion network can be utilized to generate corresponding images and audios based on the image features and the voiceprint features of the target users, and contents which do not accord with the image features and the voiceprint features in the media stream can be removed, so that the electronic equipment can acquire and play videos and audios containing the target users, the interference of audio and video playing is reduced, and the audiovisual experience of the users is improved.

Description

Video processing method, system and electronic equipment

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video processing method, a system, and an electronic device.

Background

When the electronic equipment is used for video call and conference, if the interference person outside the participants exists in the range of the image and the sound collected by the electronic equipment, the image and the sound generated by the interference person can be collected by the electronic equipment and generate corresponding signals, so that the video quality of the participants is influenced, and the application experience of the video call and conference is reduced.

In the scenes such as video call, conference and the like, the interference of an interferer on the image generated by the electronic equipment can be reduced by blurring the background or filling the virtual background and the like. But the modes of blurring background, filling virtual background and the like can only process image signals in the video, and the electronic equipment collects audio signals generated by actions of the jammers to still exist in the video, so that the playing quality of the video is affected and the audiovisual experience of a user is reduced.

Disclosure of Invention

The embodiment of the application provides a video processing method, a system and electronic equipment, which are used for solving the problem that signals generated by an interferer in media signals interfere with video playing.

In a first aspect, an embodiment of the present application provides a video processing method, applied to an electronic device, where the method includes: acquiring image features and voiceprint features of a target user; based on the image characteristics and the voiceprint characteristics, acquiring a target image and target audio of a target user from a first media stream by using a diffusion network; a second media stream is generated based on the target image and the target audio.

Therefore, the diffusion network can be utilized to generate corresponding images and audios based on the image features and the voiceprint features of the target users, and contents which do not accord with the image features and the voiceprint features in the media stream can be removed, so that the electronic equipment can acquire and play videos and audios containing the target users, the interference of audio and video playing is reduced, and the audiovisual experience of the users is improved.

In one possible implementation, acquiring the image features and the voiceprint features of the target user includes: acquiring a sample image and sample voice of a target user; the sample image and sample speech are encoded separately to obtain image features and voiceprint features. Therefore, the image features and the voiceprint features can be obtained by encoding the sample image and the sample voice of the target user, so that the first media stream can be trained through the image features and the voiceprint features in the subsequent process, and the processing efficiency of the media stream is improved.

In one possible implementation, the sample image and sample speech are images and audio stored in an electronic device for identifying a user. In this way, the sample image and the sample voice of the target user can be obtained by directly calling the image and the audio stored in the electronic equipment, so that the electronic equipment can conveniently obtain the image characteristics and the voiceprint characteristics.

In one possible embodiment, obtaining a sample image and a sample voice of a target user includes: acquiring a first video frame from a first media stream, wherein the first video frame comprises at least one image of a user; determining a target user in a first video frame; extracting an image of a target user from a first video frame as a sample image; a speech segment of a target user is extracted from a first media stream as a sample speech. In this way, the target user can be selected through the received first media stream, and the flexibility of selecting the target user is improved.

In one possible implementation, the first media stream includes a first video and a first audio acquired in real time, and the acquiring, based on the image feature and the voiceprint feature, the target image and the target audio of the target user from the first media stream using the diffusion network includes: extracting features of the first video to obtain first video features, and extracting features of the first audio to obtain first audio features; based on the image features and the voiceprint features, a target image is acquired from the first video features and target audio is acquired from the first audio features using a diffusion network. In this way, when the target image and the target audio are generated by using the diffusion network, the processing speed of the diffusion network on the first media stream is improved by extracting the characteristics of the first video and the first audio.

In a possible implementation manner, the first video feature includes a plurality of video frame features, and feature extraction is performed on the first video to obtain the first video feature, including: carrying out framing treatment on the first video to obtain a plurality of video frames; and respectively extracting the characteristics of each video frame to obtain the video frame characteristics corresponding to each video frame. Thus, the feature extraction process of the first video can be realized, so that one input data of the diffusion network is obtained, and the diffusion network can process the first video.

In a possible implementation manner, the first audio feature includes a plurality of audio frame features, and feature extraction is performed on the first audio to obtain the first audio feature, including: carrying out framing treatment on the first audio to obtain a plurality of first audio frames; time domain windowing is carried out on each first audio frame respectively to obtain a plurality of second audio frames; and respectively carrying out short-time Fourier transform on each second audio frame to obtain the audio frame characteristics corresponding to each second audio frame. Thus, the feature extraction process of the first audio can be realized, so that one input data of the diffusion network is obtained, and the diffusion network can process the first audio.

In one possible implementation, based on the image feature and the voiceprint feature, using the diffusion network, obtaining the target image from the first video feature and obtaining the target audio from the first audio feature includes: fusing the image features and the voiceprint features to obtain fusion features; taking the fusion feature, the first video feature and the first audio feature as inputs of a diffusion network, respectively extracting target images from each video frame feature and respectively extracting target audio from each audio frame feature by using the diffusion network. Thus, the image features and the voiceprint features can be combined, and the diffusion network can use the fusion features to train and acquire the target image and the target audio so that the target image and the target audio accord with the fusion features to acquire the video and the audio only comprising the target user.

In a second aspect, an embodiment of the present application provides a video processing system, including: the acquisition module is configured to acquire image features and voiceprint features of a target user; a processing module configured to: based on the image characteristics and the voiceprint characteristics, acquiring a target image and target audio of a target user from a first media stream by using a diffusion network; and generating a second media stream based on the target image and the target audio.

In a third aspect, an embodiment of the present application provides an electronic device, including: a display screen, a memory, and one or more processors; the display screen and the memory are coupled with the processor; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the video processing method as in any of the first aspects.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform a video processing method as in any of the first aspects.

It will be appreciated that the advantages achieved by the technical solutions provided in the second aspect to the fourth aspect may refer to the advantages in any feasible implementation manner of the first aspect, and are not described herein.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic illustration of a diffusion model;

FIG. 2 is a schematic diagram of the operation of a diffusion model;

FIG. 3 is a schematic diagram showing a video playback;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 5 is a schematic diagram of a layered architecture of a software system of an electronic device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a video processing method according to an embodiment of the present application;

fig. 7 is a schematic flow chart of a video processing method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a target image according to an embodiment of the present application;

Fig. 9 is a schematic diagram of extracting target audio according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an image feature and voiceprint feature acquisition method according to an embodiment of the present application;

FIG. 11 is a flowchart of a sample image and sample voice acquiring method according to an embodiment of the present application;

fig. 12 is a schematic diagram of a network structure of sample image encoding according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a network structure of a sample speech coding according to an embodiment of the present application;

Fig. 14 is a flowchart of a first media stream processing method according to an embodiment of the present application;

FIG. 15 is a schematic diagram of a video frame according to an embodiment of the present application;

FIG. 16 is a schematic waveform diagram of a window function according to an embodiment of the present application;

fig. 17 is a schematic diagram of audio frame windowing according to an embodiment of the present application;

FIG. 18 is a flowchart of a method for extracting a target image and a target audio according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a diffusion network according to an embodiment of the present application;

FIG. 20 is a schematic diagram of a video processing system according to an embodiment of the present application;

FIG. 21 is a schematic diagram of a system-on-chip in an embodiment of the application;

fig. 22 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application.

In the description of the present application, "/" means "or" unless otherwise indicated, for example, A/B may mean A or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. Furthermore, "at least one" means one or more, and "a plurality" means two or more. The terms "first," "second," and the like do not limit the number and order of execution, and the terms "first," "second," and the like do not necessarily differ.

In the present application, the words "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

Some terms in the embodiments are first described below.

Diffusion model (diffusion model), also known as diffusion network, generative model, etc., is a mathematical model used to describe the diffusion process in a system, and can be used to describe the diffusion motion between molecules, or the propagation of information, ideas, or behaviors between populations in sociology. In physics, the phenomenon that gas molecules diffuse from a high concentration region to a low concentration region is similar to the loss of information due to interference of signals with noise. Therefore, in a specific image generation field, the diffusion model can generate an image by learning information attenuation due to noise and then using the learned pattern.

Attention mechanism (Attention) is a method that allows a model to selectively focus on important information. The principle of the attention mechanism is to obtain an attention profile by calculating the correlation between the input and the output, and then weight the input information by the attention profile to locate features in the input information.

Convolutional neural networks (Convolutional Neural Network, CNN), a deep learning model based on modeling the structure and function of biological neural networks. CNNs may be used to perform feature recognition of images and/or speech to enable feature extraction of images and/or speech.

The fully connected layer (Fully Connected Layer, FC), which is a layer structure in CNNs, is used to classify the features extracted by CNNs. Illustratively, the FC may implement classification of the feature by increasing the weight.

Embedding (embedding) refers to the process of mapping data to a low-dimensional vector space. embedding can be used to dimension down the data, map it to a token vector that can characterize embedding processed data, so that the electronic device can analyze and process the data features by manipulating the token vector.

Fig. 1 is a schematic diagram of a diffusion model.

As shown in fig. 1, during the training phase, the diffusion model includes a forward diffusion process (Forward Diffusion Process) that is a noise-adding process and a backward diffusion process (Reverse Diffusion Process) that is a noise-removing process. For example, the process from X ₀ to X _T is a forward diffusion process, and the process from X _T to X ₀ is a reverse diffusion process.

It should be appreciated that p _θ(x_t-1|x_t shown in fig. 1) is the denoising process from X _t to X _t-1 in the back diffusion process, and q (X _t|x_t-1) is the denoising process from X _t-1 to X _t in the forward process.

In the forward diffusion process, the noise signal/picture can be finally obtained by continuously adding noise to the clean signal/picture. Specifically, during training of each sample, a Gaussian noise signal/image is obtained by randomly adding T steps of Gaussian noise.

During back diffusion, the diffusion model can predict the noise amount of the current sample to optimize the model loss through a Bayesian method, and the model has the capability of reasoning X ₀ from X _t through N steps along with the gradual reduction of the model loss. During reasoning, predicting the noise of each step to predict the current mean value and variance, adding random Gaussian noise to generate diversification, and finally obtaining a desired result.

Further, in the back diffusion process, the diffusion model can specifically denoise by introducing the characteristic information, so as to generate an image conforming to the characteristic information. Wherein the characteristic information is typically text information identified.

FIG. 2 is a schematic diagram of the operation of a diffusion model.

The purpose of the diffusion model training is to learn the probability distribution of the target signal (clean signal) by continuously adding interference, so that a noise image can be denoised into an image with a certain characteristic through the diffusion model. In one possible implementation, referring to fig. 2, the diffusion model may process a preset or randomly generated noise image 102 step by step (each step of processing may include denoising and generating an image) based on an externally input text prompt 101 to obtain a final image 103 corresponding to the text prompt 101.

Wherein the text prompt 101 may be a word combination or phrase/sentence composed of a plurality of keywords, such as: a smiling figure.

The randomly generated noise image 102 may then be the result of the diffusion model adding any possible type of noise (e.g., gaussian noise) to any image multiple times. The noise type can be determined in the training process of the diffusion model, and the noise type in the image is consistent when the diffusion model is used at any time later, so that the diffusion model can smoothly use the information loss effect caused by the learned noise when in use, and the image is processed.

The application scenario of the embodiment of the present application is described below with reference to the accompanying drawings.

Fig. 3 is a schematic diagram of playing a video.

As shown in fig. 3, in a scene such as a video call or a conference, other people 20 may exist around the playing target 10, and actions and/or sounds generated by the other people 20 may be changed into video signals and/or audio signals through the electronic device 1, so as to affect the quality of the video signals and the audio signals generated by the playing target 10. Illustratively, the video signals of the play target 10 and the other person 20 are displayed on the electronic device 1, and the sound emitting device of the electronic device 1 sends out the audio signals generated by the play target 10 and the other person 20.

In some embodiments, the background around the playing object 10 may be blurred or the virtual background may be filled, so that the video signal generated by other people 20 may be masked, and the interference of other people 20 on the playing object 10 may be reduced. But the modes of blurring the background, filling the virtual background and the like can only process the image signals in the video, and the audio signals generated by other people 20 collected by the electronic equipment 1 still exist in the video, so that the playing quality of the video is reduced.

Therefore, the application provides an audio and video processing method, an audio and video processing system and electronic equipment, which are used for processing the audio and video received and played by the electronic equipment so as to generate the audio and video including the image and the voice of the target user and play the audio and video, so that the user can control the played audio and video through the electronic equipment, and the interference caused by non-target users in the audio and video is reduced.

The technical scheme provided by the application can be applied to the electronic equipment with the audio and video playing function. In some embodiments, the electronic device may be a mobile phone, a tablet computer, a handheld computer, a personal computer (personal computer, PC), an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) device, a wearable device, a vehicle-mounted device, a smart home device, and/or a smart city device, and the specific type of the electronic device is not particularly limited by the embodiments of the present application.

For example, taking an electronic device as a mobile phone as an example, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

As shown in fig. 4, the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a display 193, a subscriber identity module (subscriber identification module, SIM) card interface 194, a camera 195, and the like. The sensor module 180 may include, among other things, a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and command center of the electronic device. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

External memory interface 120 may be used to store computer-executable program code that includes instructions. The external memory interface 120 may include a storage program area and a storage data area. The storage program area may store an operating system, application programs (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the external memory interface 120 may include one or more memory units, for example, may include volatile memory (volatile memory), such as: dynamic random access memory (dynamic random access memory, DRAM), static random access memory (static random access memory, SRAM), etc.; non-volatile memory (NVM) may also be included, such as: read-only memory (ROM), flash memory (flash memory), and the like. The processor 110 executes various functional applications of the electronic device and data processing by executing instructions stored in the external memory interface 120 and/or instructions stored in a memory provided in the processor.

The charge management module 140 is configured to receive a charge input from a power supply device (e.g., a charger, notebook power, etc.). The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device.

The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142. The battery 142 may specifically be a plurality of batteries connected in series. The power management module 141 is used for connecting the battery 142, the charge management module 140 and the processor 110.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the display 193, the camera 195, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor parameters such as battery voltage, current, battery cycle number, battery state of health (leakage, impedance), etc. In other embodiments, the power management module 141 may also be provided in the processor 110.

The wireless communication function of the electronic device may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device may be used to cover a single or multiple communication bands. In some embodiments, the antenna may be used in conjunction with a tuning switch, and different antennas may also be multiplexed to increase the utilization of the antenna.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied on an electronic device. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 193. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may include a wireless fidelity (WIRELESS FIDELITY) module, a Bluetooth (BT) module, a GNSS module, a Near Field Communication (NFC) module, an Infrared (IR) module, and the like. The wireless communication module 160 may be one or more devices integrating at least one of the modules described above. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

Touch sensors, also known as "touch devices". The touch sensor may be coupled to a display screen 193 such that the touch sensor and the display screen 193 form a touch screen, also referred to as a "touch screen". The touch sensor is used to monitor touch operations acting on or near it. The touch sensor may communicate the monitored touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display 193. In other embodiments, the touch sensor may also be disposed on a surface of the electronic device other than where the display 193 is located.

The pressure sensor is used for sensing a pressure signal and can convert the pressure signal into an electric signal. In some embodiments, the pressure sensor may also be coupled to a display screen 193. Pressure sensors are of many kinds, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, etc. When a touch operation is applied to the display screen 193, the electronic apparatus monitors the intensity of the touch operation according to the pressure sensor. The electronic device may also calculate the location of the touch based on the monitoring signal of the pressure sensor. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions.

In some embodiments, the electronic device may include 1 or N cameras 195, N being a positive integer greater than 1. In an embodiment of the present application, the type of camera 195 may be differentiated according to hardware configuration and physical location. For example, a camera provided on the side of the display screen 193 of the electronic device may be referred to as a front camera, and a camera provided on the side of the rear cover of the electronic device may be referred to as a rear camera; for another example, a camera with a short focal length and a large view angle may be referred to as a wide-angle camera, and a camera with a long focal length and a small view angle may be referred to as a normal camera. The focal length and the visual angle are relative concepts, and are not limited by specific parameters, so that the wide-angle camera and the common camera are also relative concepts, and can be distinguished according to physical parameters such as the focal length, the visual angle and the like.

The electronic device implements display functions through a GPU, a display screen 193, an application processor, and the like. The GPU is a microprocessor for image editing, and is connected to the display 193 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The electronic device may implement photographing functions through an ISP, a camera 195, a video codec, a GPU, a display screen 193, an application processor, and the like. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information. In the embodiment of the application, the GPU function is used in the frame drawing process of each image frame, so that the finally displayed picture obtains better display effect and performance.

The ISP is used to process the data fed back by the camera 195. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also perform algorithm optimization on noise and brightness of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in the camera 195. The camera 195 is used to capture still images or video.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, and so on.

The display 193 is used to display images, videos, and the like. The display 193 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In some embodiments, the electronic device may include 1 or N display screens 193, N being a positive integer greater than 1.

In embodiments of the application, the display 193 may be used to display an interface (e.g., desktop, lock screen interface, etc.) of the electronic device and display images in the interface from images stored in the electronic device (e.g., wallpaper, photographs, etc.), or images captured by any one or more of the cameras 195.

It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments. Meanwhile, the above illustration of fig. 2 is merely an exemplary illustration when the electronic device is in the form of a tablet computer. If the electronic device is a mobile phone, a handheld computer, a PC, a PDA, a wearable device (such as a smart watch, a smart bracelet), etc., the electronic device may include fewer structures than those shown in fig. 4, or may include more structures than those shown in fig. 4, which is not limited herein.

The SIM card interface 194 is used to connect to a SIM card. The SIM card may be inserted into the SIM card interface 194, or removed from the SIM card interface 194 to effect contact and separation with the electronic device. The electronic device may support one or more SIM card interfaces. The SIM card interface 194 may support a Nano SIM card, micro SIM card, etc. The same SIM card interface 194 may be used to insert multiple cards simultaneously. The SIM card interface 194 may also be compatible with external memory cards. The electronic equipment interacts with the network through the SIM card, so that the functions of communication, data communication and the like are realized. One SIM card corresponds to one subscriber number.

It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The above-described fig. 4 is merely an exemplary illustration of the case where the electronic device is a mobile phone. If the electronic device is a tablet computer, a handheld computer, a PC, a PDA, a wearable device (e.g., a smart watch, a smart bracelet), etc., the electronic device may include fewer structures than those shown in fig. 4, or may include more structures than those shown in fig. 4, which is not a limitation of the present application.

It will be appreciated that in general, implementation of electronic device functions requires software in addition to hardware support. The software system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software structure of the electronic equipment is illustrated by an example.

Fig. 5 is a schematic diagram of a layered architecture of a software system of an electronic device according to an embodiment of the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface (e.g., API).

In some examples, as shown in fig. 5, in the embodiment of the present application, the software of the electronic device is divided into five layers, namely, an application layer, a framework layer (or referred to as an application framework layer), a system library and android runtime (android runtime), a HAL layer (hardware abstraction layer, a hardware abstraction layer) and a driver layer (or referred to as a kernel layer) from top to bottom. The system library and android runtime may also be referred to as a local framework layer or a native layer.

The application layer may include a series of applications, among others. As shown in fig. 5, the application layer may include Applications (APP) such as camera, gallery, calendar, map, WLAN, bluetooth, music, video, short message, talk, navigation, instant messaging, wallpaper, etc.

In the embodiment of the application, the application program layer can also comprise an audio and video processing application. The audio and video processing application can process the video and the audio based on the operation of the user on the video and by combining the image characteristics and the voiceprint characteristics of the selected target user, and process the video and the audio which are being played, so that the audio and video can meet the requirements of the user and reduce the noise interference.

In some embodiments, the audio video processing application may be a video application.

The framework layer provides an application programming interface (application programming interface, API) and programming framework for the application programs of the application layer. The application framework layer includes a number of predefined functions or services. For example, the application framework layer may include an activity manager, a window manager, a content provider, an audio service, a view system, a telephony manager, a resource manager, a notification manager, a package manager, etc., to which embodiments of the application are not limited in any way.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. Such data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is for providing communication functions of the electronic device. For example, the telephony manager may manage the call state (including initiate, connect, hang-up, etc.) of the call application.

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

The package manager is used in An Zhuo systems to manage application packages. It allows applications to obtain detailed information about installed applications and their services, rights, etc. The package manager is also used for managing events such as installation, uninstallation, upgrade and the like of the application program.

In the embodiment of the application, the frame layer can also comprise an audio/video processing service with the function of the audio/video processing application. Under the condition that the audio and video processing application does not exist in the application program layer or the audio and video processing application cannot be used or the audio and video which is required to be processed by a user cannot be operated by the audio and video processing application, the audio and video processing service performs the same action as the audio and video processing application on the corresponding audio and video so as to realize the playing processing of the audio and video.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (Media Libraries), openGL ES, SGL, etc. The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio video encoding formats, such as: MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc. OpenGL ES is used to implement three-dimensional graphics drawing, image rendering, compositing, and layer processing, among others. SGL is the drawing engine for 2D drawing. The android runtime (android runtime) includes a core library and an ART virtual machine. Android runtime is responsible for scheduling and management of the android system. The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android. The application layer and the application framework layer run in an ART virtual machine. The ART virtual machine executes java files of the application program layer and the application program framework layer into binary files. The ART virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The HAL layer is an interface layer between the operating system kernel and the hardware circuitry that aims at abstracting the hardware. The hardware interface details of a specific platform are hidden, a virtual hardware platform is provided for an operating system, so that the operating system has hardware independence, and can be transplanted on various platforms. The HAL layer provides a standard interface to display device hardware functionality to a higher level Java API framework (i.e., framework layer). The HAL layer contains a plurality of library modules, each of which implements an interface for a particular type of hardware component, such as: audio HAL audio module, blue HAL bluetooth module, CAMERA HAL camera module (which may also be referred to as camera HAL or camera hardware abstraction module), sensors HAL sensor module (or as Isensor service).

The kernel layer is a layer between hardware and software. The inner core layer at least comprises display drive, camera drive, audio drive, sensor drive, battery drive and the like, and the application is not limited. The sensor driver may specifically include a driver of each sensor included in the electronic device, for example, an ambient light sensor driver, or the like. For example, the ambient light sensor driver may send the ambient light sensor detection data to the sensing module in a timely manner in response to an indication or instruction by the sensing module to obtain the detection data.

The technical scheme provided by the embodiment of the application can be realized in the electronic equipment with the hardware architecture or the software architecture.

The following describes a video processing method according to an embodiment of the present application with reference to fig. 6 and fig. 7. Fig. 6 is a schematic diagram of a video processing method according to an embodiment of the present application, and fig. 7 is a flowchart of a video processing method according to an embodiment of the present application. As shown in fig. 6 and fig. 7, taking an electronic device as a smart phone as an example, the video processing method provided by the embodiment of the application may include:

s100: and acquiring the image characteristics and the voiceprint characteristics of the target user.

The target user is a target expected to be seen by a user watching the video, the image characteristics of the target user are information for representing the image characteristics of the target user, and the voiceprint characteristics of the target user are information for representing the sound characteristics of the target user. For example, the image features of the target user may be extracted from the character image corresponding to the target user, and the voiceprint features of the target user may be extracted from the voice clip corresponding to the target user.

Furthermore, the character image for acquiring the image features should have a front image of a person, and the face of the person in the image is not obviously blocked, so that the feature extraction of a target user is facilitated, the feature recognition, the matching and the like in the subsequent process are facilitated, and the occurrence of recognition errors and the like caused by the undefined image features is reduced. The voice segment of the voiceprint feature acquired by the user needs to be a segment of voice of the user, and in order to improve the precision of feature extraction and recognition, the voice segment in practical application needs to only include a segment of voice of one user, so that interference in the feature extraction process is reduced, and the accuracy of the voiceprint feature is improved.

In the embodiment of the application, as a plurality of targets expected to be seen by a user watching the video can exist, the number of target users can also be one or more, so that the requirement of watching a plurality of targets simultaneously when the user watches the video is met.

It should be noted that, when the video played by the electronic device is related to the media stream, if data generated by other users outside the target user exists in the media stream, the electronic device also plays the data generated by other users, thereby causing interference to the target user.

The media stream received by the electronic equipment can be processed in the subsequent process by acquiring the image characteristics and the voiceprint characteristics of the target user, so that the interference generated by other users is reduced, and the video watching effect is improved.

It should be understood that when a plurality of users perform video services such as video call and conference through the first electronic device and the second electronic device, the first electronic device and the second electronic device may be connected through wireless communication, so as to implement transmission interaction of media streams between each other, and play the media streams received by each other by using the display screen and the audio module respectively.

The first electronic device and the second electronic device are electronic devices for performing interaction in video services such as video call and conference, and the first electronic device and the second electronic device can generate, send and receive media streams. Correspondingly, the first electronic device can receive and play the media stream generated by the second electronic device, and the second electronic device can receive and play the media stream generated by the first electronic device.

In some embodiments, the first electronic device may also play the media stream collected by itself, and the second electronic device may also play the media stream collected by itself. The application is not limited to the source of the media stream in the electronic device.

As an example of a video playback scenario shown in fig. 3, a playback target 10 and other users 20 may be included in a media stream played by the electronic device 1. In some scenarios, in order to reduce interference, a user who views a media stream using the electronic device 1 may only retain images and voices of play targets 10 that want to be viewed, and remove images and voices generated by other users 20, so that only play targets 10 are included in the media stream played by the electronic device 1. In the embodiment of the present application, the playing target 10 may be a target user, and the electronic device may perform a subsequent media stream processing process by acquiring the image feature and the voiceprint feature of the playing target 10.

S200: based on the image features and the voiceprint features, a target image and target audio of a target user are acquired from the first media stream using the diffusion network.

In the process of receiving and/or playing the first media stream, the electronic equipment can separately process the audio and the video in the first media stream so as to improve the processing efficiency of the electronic equipment on different kinds of signals. In an embodiment of the present application, the first media stream may include a first video and a first audio collected in real time. The second electronic device may collect the first video and the first audio as the first media stream, and send the first media stream to the first electronic device, and after the first electronic device receives the first media stream, the first electronic device may process the first media stream to improve the playing quality of the video.

In the embodiment of the application, the first video may be a video signal collected or received by the current electronic device, and the first audio may be an audio signal collected or received by the current electronic device. Illustratively, the first video and the first audio are the same duration and the start time stamp of the first video is the same as the start time stamp of the first audio. In this way, the first video and the first audio that are received or generated simultaneously may be processed to obtain the video feature of the first video and the audio feature of the first audio, and then the diffusion network may perform feature processing using the video feature of the first video and the audio feature of the first audio to obtain the target image according to the image feature and the target audio according to the voiceprint feature.

In the process of acquiring the target image and the target audio, the diffusion network can extract images conforming to the image characteristics from the first video of the first media stream by utilizing the image characteristics, and remove the characteristics of other users not conforming to the image characteristics from the extracted images so that the generated target image only comprises users conforming to the image characteristics. Similarly, the diffusion model may extract, based on the voiceprint features, audio that matches the voiceprint features in the first audio of the first media stream, and remove features of other users that do not match the voiceprint features in the extracted audio, so that only users that match the voiceprint features are included in the generated target audio.

It should be understood that the first video and the first audio have time stamps, and the diffusion network reserves the time stamps in the process of extracting the target image and the target audio, so that the obtained target image and the target audio are synchronous in audio-visual experience of the user through time stamp sorting.

It should be understood that when the diffusion network extracts the target image and the target audio corresponding to the target user, the features in the first video and the features in the first audio are deleted or reserved through the coincidence degree of the features of the first video and the first audio with the image features and the voiceprint features, so as to obtain the target image and the target audio of the target user.

Fig. 8 is a schematic diagram of a target image according to an embodiment of the present application.

As shown in fig. 3 and 8, taking the processing of the media stream played in the electronic device 1 as an example, when the target user is the play target 10, the electronic device 1 may extract the target image corresponding to the play target 10 from the media stream by using the image features of the play target 10. As shown in fig. 8, other users 20 in the target image are removed, and only the picture of the play target 10 is reserved, thereby achieving the purpose of acquiring the target image corresponding to the play target 10.

Fig. 9 is a schematic diagram of extracting target audio according to an embodiment of the present application.

As shown in fig. 9, when there are a plurality of user-generated audio signals in the first media stream, a user viewing the media stream through the electronic device is also affected by a plurality of different audio signals. Because of the voice content and the sounding difference among users, certain differences exist among audio signals generated by a plurality of users, and the audio signals in the first media stream are messy, so that the audio-visual experience of the users is not improved. In the embodiment of the application, the voiceprint characteristics of the target user are determined, and the voiceprint characteristics are utilized to extract in the first media stream, so that the target audio only comprising the voiceprint characteristics is obtained, the interference in the audio signal is reduced, and the definition of the audio is improved.

It should be understood that the audio signal generated by the actual user is usually an aperiodic signal, and the frequency and amplitude of the audio signal are different from those of the waveform shown in fig. 9, and the audio signal shown in fig. 9 is only an example, and the waveform of the audio signal generated by the user is not particularly limited in the embodiment of the present application.

In some embodiments of the present application, in addition to the audio signal generated by the user, there may be an audio signal corresponding to the environmental sound and noise generated during the acquisition or even transmission process in the first media stream, where the frequency, amplitude, etc. of the waveform of the audio signal are different from those of the audio signal generated by the user. Therefore, in the process of extracting the target audio, the audio signals corresponding to the environmental sound, the noise and the like can be removed together by extracting the target audio, so that the definition of the target audio is improved.

S300: a second media stream is generated based on the target image and the target audio.

After the target image and the target audio are obtained, corresponding time stamp information can be reserved in the target image and the target audio, and the target image and the target audio can be sequenced by using the time stamps, so that the sound and the picture of the target image and the sound and picture of the target audio are synchronized, and a second media stream is generated. In the embodiment of the application, the electronic equipment can play the second media stream through the components such as the display screen, the loudspeaker and the like while generating the second media, thereby realizing the processing and playing process of the first media stream.

In this way, the first media stream received or generated by the electronic device can be processed, audio and video generated by other users except the target user in the first media stream are filtered, the influence of the audio and video generated by other users on the playing of the first media stream is reduced, and the audio-visual experience of the user is improved.

Fig. 10 is a schematic diagram of an image feature and voiceprint feature acquiring manner according to an embodiment of the present application.

In some embodiments of the present application, the electronic device may obtain the image feature and the voiceprint feature of the target user through the image and the voice of the user stored in the electronic device, or may acquire the image feature and the voiceprint feature in other manners. As illustrated in fig. 7 and 10, the process of acquiring the image feature and the voiceprint feature of the target user in step S100 may include:

S110: and acquiring a sample image and sample voice of the target user.

Wherein the sample image and sample voice are images and voices for identifying the user. Specifically, the sample image may be a front image of a user, and the sample image has a facial image which is relatively complete and can embody facial features of the user; the sample speech may be speech uttered by a user speaking normally. In the embodiment of the application, each sample image has a corresponding sample voice, and the sample image and the sample voice which correspond to each other correspond to the same user.

It should be appreciated that the sample image and sample speech are used to extract image features and voiceprint features of the target user, thereby enabling content extraction of the first media stream in a subsequent step.

In the embodiment of the application, the process of selecting the sample image and the sample voice is the process of selecting the target user. For example, the sample image and sample voice may be images and audio stored in an electronic device for identifying a user. When the target user is selected, the electronic device can respond to the selection of the sample image by the user to determine the sample image and the corresponding sample voice, so that the determination process of the target user is completed.

Fig. 11 is a flowchart of a sample image and sample voice acquiring method according to an embodiment of the present application.

In some embodiments, the user may select the target user by clicking on the content displayed on the display while the electronic device plays the first media stream. As shown in fig. 11, the process of acquiring the sample image and the sample voice may further include:

s111: a first video frame is acquired from a first media stream.

It should be appreciated that the first media stream is a media stream generated or received by an electronic device, which may play the first media stream; the first video frame may be a frame of an image in the first media stream, the first video frame including an image of at least one user.

After obtaining the first media stream, the electronic device may obtain a first video frame in the first media stream. In the process of acquiring the first video frame, it is required to determine whether the user is included in the video frame, and only the video frame including the image of at least one user may be selected as the first video frame.

The first video frame may be a video frame randomly selected by the electronic device in the first media stream, or may be a video frame including at least one user in the first media stream, and the method for acquiring the first video frame is not limited in the present application.

S112: a target user in a first video frame is determined.

After the first video frame is acquired, the target user may be determined by the user in the first video frame. For example, after the first video frame is acquired, the electronic device may display the first video frame on the display screen, and further select the target user by receiving an instruction from the user.

In some embodiments, the electronic device may further determine whether the user speaks through facial features of the user in the first video frame, thereby selecting the user who is speaking as the target user. The method and the device for determining the target user are not limited in the embodiment of the application.

S113: an image of a target user is extracted from a first video frame as a sample image.

After the target user is determined, an image of the target user may be extracted in the first video frame as a sample image. In the process of extracting the image of the target user, the image of the target user can be obtained as a sample image by extracting the image in the outline after the outline of the target user is obtained. It should be understood that the above method is only one possible method for selecting a sample image according to the present application, and the present application is not limited to the manner in which the sample image is extracted from the first video frame.

S114: a speech segment of a target user is extracted from a first media stream as a sample speech.

After the target user is determined, a speech segment of the target user when speaking through the opening can be selected in the first media stream, so that sample speech is obtained. It should be appreciated that the speech segments have a certain length, e.g. the speech segments may comprise a complete sentence of the target user, so that in a subsequent step the voiceprint characteristics of the target user can be derived from the sample speech.

It should be noted that, the execution sequence of step S113 and step S114 shown in fig. 11 is only one possible implementation, step S114 may be performed before step S113, step S113 and step S114 may be performed simultaneously after step S112, and the execution sequence of step S113 and step S114 is not limited in the present application.

S120: the sample image and sample speech are encoded separately to obtain image features and voiceprint features.

After the sample image and the sample voice are obtained, as shown in fig. 7, the sample image and the sample voice may be encoded, so as to obtain the image feature and the voiceprint feature corresponding to the target user, respectively.

Fig. 12 is a schematic diagram of a network structure of sample image encoding according to an embodiment of the present application.

As shown in fig. 12, the process of encoding a sample image to obtain image features may be performed using CNN. The CNN is provided with two layers of Attention between CNN2d for convolution and FC for feature classification, so that extracted features are positioned and screened through an Attention mechanism, and the accuracy of feature extraction and recognition is improved.

CNN after receiving the sample image, CNN2d may convolve the sample image, thereby extracting features in the sample image. After the CNN2d extracts the features, the two layers of Attention mechanisms can sequentially locate and screen the features, so that the features of the target user can be reflected, and the recognition probability of the image features is improved. After the extracted features are screened, the features are classified by using FC to obtain the weights of the screened features, so that the features are weighted, and the image feature vectors generated in the subsequent steps can better represent the features of the target user.

After the weighted features are obtained, the weighted features are mapped to a token vector through embedding, thereby obtaining an image feature vector as the image feature of the target user.

Fig. 13 is a schematic diagram of a network structure of a sample speech coding according to an embodiment of the present application.

As shown in fig. 13, the process of encoding the sample speech to obtain the voiceprint feature may also be performed using CNN, but there is a certain difference in structure between the CNN for obtaining the voiceprint feature and the CNN for obtaining the image feature.

In an embodiment of the application, the CNN used for voiceprint feature extraction includes CNN2d, attention, FC and a pooling layer (pooling layer) for convolution. Therefore, the characteristics can be selected by utilizing the Attention and the pooling layer, so that the recognition accuracy of the electronic equipment on the voiceprint characteristics is improved.

Illustratively, after receiving the sample speech, CNN2d may convolve the sample speech to extract features in the sample speech. After the CNN2d extracts the features, the Attention mechanism locates and screens the extracted features, so that the features which can embody the sound characteristics of the target user are obtained, and the recognition probability of the voiceprint features is improved. After the extracted features are screened, the features are classified by using FC to obtain the weights of the screened features, so that the features are weighted, and the voiceprint feature vectors generated in the subsequent steps can better represent the features of the target user. And then reducing the dimension of the features through a pooling layer, and ensuring that the features are unchanged at the same time so as to compress the volume of the obtained features.

The feature output by the pooling layer is input into embedding for processing, and the pooled feature can be mapped to a characterization vector through embedding, so that a voiceprint feature vector is obtained to serve as the voiceprint feature of the target user.

It should be understood that the foregoing extraction manner of the image feature and the voiceprint feature is only one possible embodiment of the present application, and the extraction manner of the image feature and the voiceprint feature may also be other manners, which are not limited in the embodiment of the present application.

Fig. 14 is a flowchart of a first media stream processing method according to an embodiment of the present application.

In the embodiment of the present application, as shown in fig. 7 and fig. 14 (a), after the first video and the first audio may be separately processed, the processed first video and first audio may be sent to a diffusion network to obtain the target image and the target audio.

Illustratively, the step of obtaining the target image and the target audio of the target user from the first media stream using the diffusion network based on the image features and the voiceprint features may comprise:

S210: and extracting the characteristics of the first video to obtain first video characteristics, and extracting the characteristics of the first audio to obtain first audio characteristics.

After the first media stream is obtained, the first video and the first audio may be processed separately to facilitate separate extraction of the respective features. Thus, the first video feature may be acquired through the first video and the first audio feature may be acquired through the first audio.

It should be appreciated that in performing feature extraction of the first video and the first audio, images in the video need to be frame-by-frame to obtain features of each frame in the first video and the first audio. In some embodiments of the present application, the first video feature may comprise a plurality of video frame features and the first audio feature may comprise a plurality of audio frame features.

As shown in fig. 14 (b), the step of extracting features of the first video to obtain features of the first video includes:

s211: and carrying out framing processing on the first video to obtain a plurality of video frames.

In embodiments of the present application, the first video may be composed of a plurality of video frames, and a single video frame may be regarded as one still image. In video, when continuous image changes exceed 24 frames per second or more, according to the principle of persistence of vision, the human eyes cannot distinguish single static images, so as to form smooth continuous visual effect.

For example, when the framing operation is performed, the characteristics that the first video is composed of a plurality of video frames may be utilized to perform the acquisition and framing processing on the first video. The frame number of the video frames in each second obtained by framing can be smaller than or equal to the frame rate of the first video, so that the time for calculating and dividing the video frames is reduced, and the electronic equipment can conveniently obtain a plurality of video frames corresponding to the first video.

S212: and respectively extracting the characteristics of each video frame to obtain the video frame characteristics corresponding to each video frame.

In the embodiment of the application, CNN can be utilized to sequentially extract the characteristics of each video frame, so as to obtain the video frame characteristics corresponding to each video frame. It should be understood that the method of extracting the video frame features in the present application is only one possible implementation, and the manner of extracting the video frame features in the present application is not limited.

After the video frame features corresponding to each video frame are obtained, the video frame features can be arranged according to the timestamp data corresponding to each video frame, so that the first video features corresponding to the first video are obtained.

Fig. 15 is a schematic diagram of a video frame according to an embodiment of the present application.

In the embodiment of the present application, the image feature is information for representing the character feature of the target user, and thus the image feature may be a character feature of the target user in general. In the process of acquiring the target image, the character features of different users in the first video are processed. Therefore, when extracting the characteristics of each video frame, the main extracted characteristics are the characteristic information of the characters in the video frame. For example, the feature condition extracted in the video frame may be adjusted by adjusting the weight corresponding to the feature of the person in the CNN.

Taking the video in fig. 3 as an example, fig. 15 may be one frame image in the video shown in fig. 3. As shown in fig. 15, the video frame includes a playing target 10, other users 20 and a background (not shown in the figure), and in the process of extracting the features, the weights of the features in the corresponding areas of the playing target 10 and the other users 20 can be increased, so that the obtained features of the video frame mainly represent the features of the users, thereby facilitating the subsequent training process.

In some embodiments of the present application, features corresponding to the background in the video frame may be retained, and in the process of generating the target image through the diffusion network, the background of the generated target image may be constructed by using the features corresponding to the background, so as to enrich the picture of the target image and improve the viewing experience of the user.

Accordingly, as shown in fig. 14 (c), the step of extracting features of the first audio to obtain first audio features includes:

S213: and framing the first audio to obtain a plurality of first audio frames.

It should be appreciated that the audio signal from the user in the first audio is not smooth over a longer period of time, and the audio signal may change continuously as the user speaks, thereby causing the first audio to change as the audio signal changes. In addition, when the user speaks the generated audio signal, the frequency of the oral cavity motion of the user is low relative to the frequency of the audio signal, so that the characteristic of the audio signal is basically unchanged, namely relatively stable, in a short time, and the audio signal can be regarded as a quasi-steady state process. The audio signal in the first audio and the audio signal in the first audio frame refer to signals formed by voice sent by a user. The present application does not limit the type of audio signal in the first audio.

In the embodiment of the application, the first audio is subjected to framing processing to obtain a plurality of first audio frames, so that the characteristics of the first audio can be conveniently extracted in the subsequent steps.

Unlike the principle of forming a video by continuously playing a plurality of video frames according to the persistence of vision, the length of an audio frame in audio is determined by the number of sampling points and the sampling frequency of the audio, and the number of sampling points of the audio depends on the encoding mode of the audio. The length of the first audio frame in the first audio is thus influenced by the coding scheme of the first audio and the sampling frequency. In an embodiment of the present application, the length of the first audio frame may be a quotient between the number of sampling points and the sampling frequency.

Illustratively, if the first audio is encoded in advanced audio coding (Advanced Audio Coding, AAC), the number of sampling points is typically 1024, calculated at the sampling frequency of 48kHz common to audio, the length of the audio frame being about 21.3ms. It should be understood that the encoding manner and sampling frequency variation of the first audio may have an effect on the length of the audio frame, and thus the length of the audio frame is not limited in the present application.

Therefore, the first audio frequency can be subjected to framing treatment according to the length of the first audio frequency frame in the first audio frequency, so that the first audio frequency frame can comprise the characteristic information of the user, and the problem of incomplete characteristic extraction caused by too short first audio frequency frame is avoided.

In the embodiment of the application, a certain overlap exists between the adjacent first audio frames, and the overlap range can be 1/2-2/3 of the length of the first audio frames, so that the continuity between the adjacent first audio frames is increased, the abrupt change of the characteristic parameters after frequency conversion caused by lower continuity is avoided, and the efficiency of subsequent characteristic extraction is improved. The overlapping range between the adjacent first audio frames is only one possible implementation manner of the present application, and the overlapping range between the adjacent first audio frames is not limited in the present application.

It should be appreciated that syllable transition areas are often present in the audio signal generated by a user speaking, where jumps in the amplitude of the audio signal occur. In the process of framing, if the jump region is located in the first audio frame, the jump region can be directly subjected to feature extraction and subsequent operation. However, if two adjacent first audio frames are not overlapped, when the amplitude jump area of the audio signal is located between the two first audio frames, the continuity of the adjacent first audio frames is poor, so that the possibility of abrupt change of the characteristic parameters is improved.

Therefore, in the embodiment of the application, the mode of overlapping two adjacent first audio frames is adopted to divide the audio signals. Taking the length of the first audio frame as an example of 20ms, the displacement of the next first audio frame and the previous first audio frame in the two first audio frames, that is, the frame shift, may be 10ms. In this way, two adjacent first audio frames can be provided with half the overlap area to improve the continuity between the first audio frames. It should be noted that the frame shift in the above embodiment is only one possible implementation of the present application, and in the embodiment of the present application, the frame shift may also be other values, which is not limited herein.

S214: and performing time domain windowing on each first audio frame to obtain a plurality of second audio frames.

After the framing operation, a plurality of first audio frames may be obtained, and since the audio signal in the first audio is usually an aperiodic signal, the waveform of the audio in the first audio frame obtained by clipping is also usually an aperiodic signal. After the first audio frame is obtained, a window function can be added to the first audio frame in the time domain, so that the first audio frame meets the requirement of Fourier transform, frequency spectrum leakage caused by direct Fourier transform of the first audio frame is reduced, and the recognition rate of audio signals in the first audio frame is improved.

It should be appreciated that the windowing operation is to reduce the truncation effect of the first audio frame by multiplying the window function in the time domain of the first audio frame to increase the continuity of the left and right ends of the first audio frame so that the two ends of the window do not change sharply and smoothly transition to 0, but so that the waveform of the first audio frame slowly drops to 0 at the two ends.

The purpose of the windowing function in the time domain is to respectively perform specific unequal weighting on a plurality of first audio frames after framing, highlight the central part of the time domain waveform of the first audio frames, and suppress the fluctuation at the two ends of the time domain waveform of the first audio frames, so that the frequency spectrum leakage at the two ends of the first audio frames is reduced.

Fig. 16 is a waveform schematic diagram of a window function according to an embodiment of the present application, and fig. 17 is a schematic diagram of audio frame windowing according to an embodiment of the present application.

By way of example, common window functions may include rectangular windows, hamming (hamming) windows, hanning (hanning) windows, and the like. Because the window function is used for intercepting the time domain intercepting function of the signal, the process of intercepting the audio frequency frames is basically performed on the time domain of the signal, and therefore the process of obtaining the first audio frequency frames from the first audio frequency frames can also be used as a windowing operation.

It should be noted that, in the framing process, only the function is intercepted from the signal, and no specific unequal weighting is performed on the first audio frame, so that the obtained first audio frame is still the same as the original signal at the corresponding position in the first audio, and at this time, the weight of the first audio frame may be 1.

In a common window function, the weight of the rectangular window may be regarded as 1, so that when the first audio is framed to obtain a plurality of first audio frames, each first audio frame may be regarded as having a rectangular window function. Since the value of the rectangular window is 1, the waveform of the first audio frame in the time domain will not change, and when the first audio frame is an aperiodic signal, fourier transforming the first audio frame still has the risk of spectrum leakage.

In the embodiment of the present application, the first audio frame is windowed, and any one of window functions such as hamming window and hanning window may be used. And after the window functions are respectively added to the plurality of first audio frames, a plurality of second audio frames are obtained, and compared with the first audio frames, the second audio frames have smaller amplitudes at two ends of the time domain waveform so as to reduce frequency spectrum leakage.

Illustratively, the expression for the hanning window may be:

。

Wherein T is time, the value of T is greater than or equal to 0 and less than or equal to T, and T is the window width of the window function.

As shown in fig. 16, a waveform diagram of a hanning window is shown, in which the horizontal axis represents time and the vertical axis represents amplitude. From the above expression of the hanning window, the amplitude at both ends of the hanning window function waveform is close to 0, while the waveform amplitude in the middle region is close to 1. In the embodiment of the application, the waveforms of other window functions, such as a hamming window function, are similar to those of a hanning window function, and are functions with the amplitude of a middle area close to 1 and the amplitude of two end areas close to 0 in the window range, so that the frequency spectrum leakage at the two ends of an audio frame can be reduced after the windowing operation is completed.

As shown in fig. 17, the second audio frame corresponding to the first audio frame may be obtained by multiplying the waveform of the first audio frame by the window function of the hanning window. Fig. 17 (a) shows a waveform diagram of a first audio frame, in which the horizontal axis represents time and the vertical axis represents amplitude. In the drawing, the horizontal axis represents time, and the vertical axis represents amplitude. It should be understood that the waveform of the audio signal in the first audio frame shown in fig. 17 (a) is only an example, and the actual waveform may be different from fig. 17 (a), and the present application is not limited herein.

As shown in fig. 17 (b), the waveform of the second audio frame obtained after the first audio frame corresponding to fig. 17 (a) is windowed is shown. Wherein the window function applied in the windowing operation is the hanning window function shown in fig. 16. After windowing, the amplitude of the second audio frame is gradually reduced from the middle to the two ends, so that when features in the second audio frame are extracted through Fourier transformation, frequency spectrum leakage is reduced, and the accuracy of feature extraction is improved. It should be understood that the window function waveforms in fig. 16 and 17 and the waveforms of the first audio frame and the second audio frame are all exemplary descriptions in the present application, and the waveforms in practical application may be different from those shown in the present application, which is not limited thereto.

It should be noted that the window function may also be added to the frequency domain of the first audio frame to reduce the spectrum leakage of the first audio frame on the frequency domain waveform. The kind and the setting area of the window function are not limited in the present application.

S215: and respectively carrying out short-time Fourier transform on each second audio frame to obtain the audio frame characteristics corresponding to each second audio frame.

After the second audio frames are obtained, the audio frame characteristics corresponding to each second audio frame can be obtained through characteristic extraction. In the embodiment of the application, short-time Fourier transform (Short Time Fourier Transform, STFT) is applied to analyze and transform the time domain and the frequency domain of each second audio frame, so as to obtain the audio frame characteristics of the first audio.

It will be appreciated that the process of short time fourier transform is to divide a longer time signal into shorter segments of the same length, and to calculate the fourier transform on each of the shorter segments. After the framing and windowing operations are performed on the first audio in step S213 and step S214, the obtained multiple second audio frames may use a short-time fourier transform manner to extract the features therein as the audio frame features.

After the second audio frames are obtained through windowing, fourier transform of each second audio frame can be calculated respectively, and the obtained result is data of frequency change along with time in the second audio frames. Therefore, the audio frame characteristics corresponding to the second audio frame can be obtained through the data with the frequency changing along with time, and the process of extracting the characteristics of the first audio is completed.

It should be understood that, in the embodiment of the present application, the execution sequence of steps S211 to S212 may be that step 211 is performed before step S212, and the execution sequence of steps S213 to S215 may be that of steps S213, S214, and S215. The steps S211 to S212 may be performed before the steps S213 to S215, the steps S211 to S212 may be performed after the steps S213 to S215, and the steps S211 to S212 may be performed simultaneously with the steps S213 to S215 without limitation of the execution sequence between the two sets of steps S211 to S212 and S213 to S215. Referring to fig. 7, in some embodiments of the present application, two sets of steps, step S211 to step S212 and step S213 to step S215, may be performed simultaneously, and the obtained result is output to the diffusion network.

S220: based on the image features and the voiceprint features, a target image is acquired from the first video features and target audio is acquired from the first audio features using a diffusion network.

After the first video feature and the first audio feature are obtained, the first audio feature and the first video feature can be trained and denoised by utilizing the diffusion network with reference to the image feature and the voiceprint feature, so that a target image corresponding to the image feature and a target audio corresponding to the voiceprint feature are obtained.

It should be appreciated that the purpose of the diffusion network to utilize the image features and the voiceprint features to denoise is to extract the corresponding target image and target audio in the first video and first audio. The diffusion network removes portions of the first video and the first audio that do not match the image features and the voiceprint features when denoising using the image features and the voiceprint features, to obtain the target image and the target audio.

In some embodiments of the present application, the first video feature may further include a feature of the surrounding environment of the user in addition to the character feature of the user. When the target image is acquired from the first video feature, the content which does not accord with the image feature can be directly removed, and then the target image with a brand new background can be obtained through diffusion network processing. And the character features of the user can be acquired in the process of extracting the first video features, so that the character features and the background features are distinguished, and further, after the image features of other users in the video frame are removed through the image features, the background features in the first video are utilized to fill the background, so that the target image identical to the background in the first video is acquired.

The method for obtaining the background of the target image is an exemplary implementation method for generating the background of the target image in the application, and the specific background obtaining method can also be a method for obtaining the background by adding a virtual background to the target image by the electronic equipment after the target image is output by the diffusion network.

Fig. 18 is a flow chart of a method for extracting a target image and a target audio according to an embodiment of the present application, and fig. 19 is a structural diagram of a diffusion network according to an embodiment of the present application.

In some embodiments of the present application, the image features and the voiceprint features may be further feature fused prior to input to the diffusion network, thereby enabling the diffusion network to process the first video and the first audio. As shown in fig. 18, the process of acquiring the target image and the target audio using the diffusion network may include:

S221: and fusing the image features and the voiceprint features to obtain fusion features.

First, after the image feature and the voiceprint feature are acquired in step S120, the image feature and the voiceprint feature may be processed and then input into the diffusion network.

As shown in fig. 7 and 19, a transducer (transducer) is further provided before the input of the diffusion network, and the image feature and the voiceprint feature can be input into the transducer respectively, so that feature fusion is realized through the transducer, and a fusion feature is obtained. Illustratively, a plurality of encoding modules (encodings) and decoding modules (decoders) may be provided in the transformer to process the input features.

Further, as shown in fig. 14, the transformer may further include an Attention mechanism to allocate weights to the fused features, so as to locate the features and improve the recognition efficiency of the features. Therefore, the feature recognition efficiency of the fusion features can be improved, and the speed of the diffusion network for extracting images and audios according to the image features and the voiceprint features can be improved.

S222: the target image is extracted from each video frame feature separately.

In an embodiment of the present application, the inputs to the diffusion network should include at least a fusion feature, a first video feature, and a first audio feature. Before the target image is extracted by using the diffusion network, the first video feature formed by each video frame feature can be processed by using the diffusion network.

As shown in fig. 19, the diffusion network may include a feature extraction layer, a bottleneck layer, and an upsampling layer. Wherein the feature extraction layer and the upsampling layer are essentially an encoder-decoder structure to implement the processing of video frame features.

For example, the feature extraction layer may include a plurality of convolution operations and a plurality of pooling operations, and the upsampling layer may include a plurality of upsampling operations and a plurality of convolution operations. The feature extraction layer can process the video frame features and reduce the dimension, and the upsampling layer can be used for recovering the dimension of the processed video frame features.

In the embodiment of the application, the characteristic extraction process of the diffusion network can be formed by utilizing two convolution operations and one maximum pooling operation. As shown in fig. 19, the feature extraction layer of the diffusion network may include four feature extraction processes, namely, a feature extraction process a, a feature extraction process b, a feature extraction process c, and a feature extraction process d. The input of the feature extraction process a is the video frame feature, and the input of the subsequent feature extraction process b to the feature extraction process d is the output of the previous feature extraction process.

It should be understood that, in the embodiment of the present application, the pooling operation is a maximum pooling operation, and the maximum pooling operation may select a maximum value within a preset range as a value of the preset range, so as to reduce the data volume and improve the processing efficiency of the diffusion model while maintaining the characteristics unchanged.

In the embodiment of the application, the video frame characteristics processed in the diffusion network can be called a channel characteristic diagram, wherein the channel characteristic diagram is obtained according to the number of filters after the convolution operation is performed on the RGB picture. Further, each pooling operation reduces the length and width of the channel feature map to half of the original length and width, and the number of channels is doubled. For example, in a certain step of the feature extraction layer, the length and width of the channel feature map is 280×280, the number of channels is 128, and after one pooling, the length and width thereof becomes 140×140, and the number of channels becomes 256.

As shown in fig. 19, for example, the feature extraction process a is taken as the first feature extraction process, if the number of channels of the feature channel map in the feature extraction process a is 64, after four pooling operations, the number of channels of the channel map output after pooling in the feature extraction process d is 1024.

It should be appreciated that in some embodiments, each convolution operation may also result in a reduction in the length and width of the channel profile. For example, in a certain step of the feature extraction layer, the length and width of the channel feature map is 282×282, the channel number is 128, and after one pooling, the length and width become 280×280, and the channel number remains unchanged and is still 128.

In some embodiments, the number of channels of the channel feature map may also be controlled during the convolution operation by increasing or decreasing the convolution kernel. For example, in the first convolution performed after the video frame feature is input to the diffusion network, if the number of channels of the video frame feature is 1 at this time, the number of channels of the channel feature map obtained after the convolution is 64.

In the embodiment of the present application, the length, width and number of channels of the channel feature map are only one exemplary implementation of the present application, and the length, width and number of channels of the channel feature map are not limited in the present application.

In the embodiment of the application, the channel feature map obtained through multiple convolution and pooling can represent important features in video frame features, and the number of parameters in the subsequent process is reduced so as to reduce the overall calculation amount of the diffusion network.

As shown in fig. 19, the input of the bottleneck layer is the channel feature map output by the last feature extraction process and the fusion features input to the diffusion network. A smaller convolution operation than the convolution kernel described above may be employed in the bottleneck layer to reduce the total number of parameters in the computation process.

Illustratively, after the video frame features input to the diffusion network are subjected to the feature extraction process for a plurality of times, a channel feature map with reduced length and width and increased channel number can be obtained. And then the channel feature map and the fusion feature can be input into a bottleneck layer together, so that the diffusion network can process the channel feature map by utilizing the fusion feature.

In an up-sampling layer in the diffusion network, dimensions of the channel feature map processed by the bottleneck layer can be restored, so that a target image corresponding to the image features is obtained. In the embodiment of the application, the number of times of execution of the upsampling operation is the same as the number of times of execution of the pooling operation in the previous step, so that the dimension of the second video feature is restored, and a target image corresponding to the image feature is obtained.

In the embodiment of the application, the up-sampling operation can double the length and width of the channel characteristic map and double the channel number, so that the channel characteristic map is converted into the target image with the same dimension as the characteristic dimension of the input video frame after the up-sampling operation and the convolution operation are carried out for a plurality of times.

In the embodiment of the application, the up-sampling process in the up-sampling layer corresponds to the feature extraction process one by one. As shown in fig. 19, an upsampling process a, an upsampling process b, an upsampling process c, and an upsampling process d may be included in the upsampling layer. Illustratively, the feature extraction process a is skip-connected to the upsampling process a, the feature extraction process b is skip-connected to the upsampling process b, the feature extraction process c is skip-connected to the upsampling process c, and the feature extraction process d is skip-connected to the upsampling process d.

In the embodiment of the application, the jump connection represents that the channel characteristic diagram before pooling operation in the characteristic extraction process is subjected to characteristic fusion with the channel characteristic diagram received in the up-sampling process, and then the fused channel characteristic diagram is used for convolution operation to obtain the processed channel characteristic diagram. For example, in the up-sampling process d, after the channel feature map output by the bottleneck layer is received and up-sampled, feature fusion can be performed with the channel feature map which is not subjected to pooling operation in the feature extraction process d, and then convolution operation of the up-sampling process d is performed. It should be appreciated that the operation in other upsampling processes is the same as that described above, and the present application is not described here.

When the jump connection is performed, the length and width of the channel feature map sent by the feature extraction process in the up-sampling process may not be consistent with the length and width of the channel feature map obtained by up-sampling in the up-sampling process. The feature extraction process can be sent to the length and width of the channel feature map in the up-sampling process to be intercepted, so that the length and width of the channel feature map obtained by up-sampling in the up-sampling process are the same.

Therefore, through feature fusion, the features of different users in the video frame features can be distinguished, so that the features which are more similar to the image features of the target user are reserved, and the accuracy of the generated target image is improved.

S223: the target audio is extracted from each audio frame feature separately.

In the embodiment of the present application, the manner of processing the audio frame data to extract the target audio is the same as the foregoing manner of processing the video frame data, which is not described herein.

It should be noted that, in the process of extracting the target image and the target audio by using the diffusion network in the embodiment of the present application, in order to utilize the fusion feature composed of the image feature and the voiceprint feature to screen the video frame feature and the audio frame feature, the feature data which does not conform to the image feature in the video frame feature is separated, and the feature data which does not conform to the voiceprint feature in the audio frame feature is separated, thereby obtaining the second video feature and the second audio feature which only include the target user, and then obtaining the target video and the target audio by using the second video feature and the second audio feature.

It should be understood that the target image obtained through step S222 is image data including only the target user, and the target audio obtained through step S223 is audio data including only the target user' S voice. In this way, the diffusion network can be utilized to process the first video and the first audio in the first media stream, so as to obtain a target image and a target audio corresponding to the target user.

In some embodiments of the present application, the diffusion network may process the video frame feature and the audio frame feature at the same time, that is, step S222 and step S223 may be performed at the same time, and no sequence exists between the two steps, so that the generating time of the target image and the generating time of the target audio are consistent, and it is convenient to generate the media stream according to the target image and the target audio.

According to the technical scheme provided by the embodiment of the application, the electronic equipment can extract the data which accords with the image characteristics and the voiceprint characteristics of the target user from the first media stream by acquiring the image characteristics and the voiceprint characteristics of the target user, so that the electronic equipment plays the video and the audio which only comprise the target user, interference is reduced, and the audio-visual experience of the user is improved.

It will be appreciated that, in order to achieve the above-mentioned functions, the electronic device includes corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The embodiment of the application can divide the functional modules of the electronic device according to the method example, for example, each functional module can be divided corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

Fig. 20 is a schematic diagram of a video processing system according to an embodiment of the present application.

Based on the video processing method provided by the embodiment of the present application, the embodiment of the present application also provides a video processing system, as shown in fig. 20, where the system includes an acquisition module 201 and a processing module 202.

Wherein the obtaining module 201 may be configured to obtain an image feature and a voiceprint feature of the target user. The processing module 202 may then be configured to obtain a target image and a target audio of the target user from the first media stream using the diffusion network based on the image feature and the voiceprint feature, and generate a second media stream based on the target image and the target audio.

In some embodiments of the present application, the obtaining module 201 and the processing module 202 may be further configured to execute the technical solution in the video processing method. Illustratively, the acquisition module 201 may be configured to perform the foregoing step S110 and the foregoing steps S111 to S114, and the processing module 202 may be further configured to perform the steps S120, S210 to S223.

The specific manner in which the respective modules perform the operations in the video processing system in the above-described embodiments has been described in detail in the embodiments of the video processing method in the above-described embodiments, and will not be specifically described here. The relevant beneficial effects of the video processing method can also refer to the relevant beneficial effects of the video processing method, and are not repeated here.

The embodiment of the application also provides electronic equipment, which comprises: a display screen, a memory, and one or more processors; the display screen and the memory are coupled with the processor; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform any of the video processing methods as provided by the foregoing embodiments. In a possible embodiment of the application, the electronic device may be a device having the structure shown in fig. 4.

Embodiments of the present application also provide a computer-readable storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform any of the video processing methods provided by the foregoing embodiments.

Embodiments of the present application also provide a computer program product containing executable instructions that, when run on an electronic device, cause the electronic device to perform any of the video processing methods as provided in the previous embodiments.

Fig. 21 is a schematic structural diagram of a chip system according to an embodiment of the present application.

Embodiments of the present application also provide a chip system, as shown in fig. 21, the chip system 2100 includes at least one processor 2101 and at least one interface circuit 2102. The processor 2101 and the interface circuit 2102 may be interconnected by wires. For example, the interface circuit 2102 may be used to receive signals from other devices (e.g., a memory of an electronic apparatus). For another example, the interface circuit 2102 may be used to send signals to other devices (e.g., the processor 2101). The interface circuit 2102 may, for example, read instructions stored in a memory and send the instructions to the processor 2101. The instructions, when executed by the processor 2101, may cause the electronic device to perform the various steps described in the embodiments above. Of course, the system-on-chip may also include other discrete devices, which are not particularly limited in accordance with embodiments of the present application.

In addition, embodiments of the present application provide an apparatus, which may be embodied as a chip, component or module, that may include a processor 2201, a memory 2202 and a communication module 2203 coupled by a communication bus 2204. Among other things, the processor 2201 may include one or more processing units, such as: the processor 2201 may include an application processor, a modem processor, a graphics processor, an image signal processor, a controller, a video codec, a digital signal processor, a baseband processor, and/or a neural network processor, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors. Memory 2202 is coupled to processor 2201 for storing various software programs and/or sets of instructions, and memory 2202 may include volatile memory and/or non-volatile memory. The software programs and/or sets of instructions in the memory 2202, when executed by the processor 2201, enable the apparatus to implement the method steps in the embodiments and implementations thereof described above.

The electronic device, the computer readable storage medium, the computer program product, the system, the apparatus, or the chip system provided in this embodiment are all configured to execute the corresponding method provided above, so that the benefits achieved by the electronic device, the computer readable storage medium, the computer program product, the system, the apparatus, or the chip system can refer to the benefits in the corresponding method provided above, and are not repeated herein.

It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A video processing method, applied to an electronic device, the method comprising:

acquiring image features and voiceprint features of a target user;

Acquiring a target image and target audio of the target user from a first media stream by using a diffusion network based on the image features and the voiceprint features;

A second media stream is generated based on the target image and the target audio.

2. The video processing method according to claim 1, wherein the acquiring the image feature and the voiceprint feature of the target user includes:

Acquiring a sample image and sample voice of a target user;

And respectively encoding the sample image and the sample voice to acquire the image characteristic and the voiceprint characteristic.

3. The video processing method of claim 2, wherein,

The sample image and the sample voice are images and audio stored in the electronic device for identifying a user.

4. The video processing method according to claim 2, wherein the acquiring the sample image and the sample voice of the target user includes:

Acquiring a first video frame from the first media stream, wherein the first video frame comprises at least one image of a user;

determining the target user in the first video frame;

extracting an image of the target user from the first video frame as the sample image;

And extracting the voice fragment of the target user from the first media stream as the sample voice.

5. The video processing method according to claim 1, wherein the first media stream includes a first video and a first audio acquired in real time, and the acquiring, based on the image feature and the voiceprint feature, a target image and a target audio of the target user from the first media stream using a diffusion network includes:

extracting features of the first video to obtain first video features, and extracting features of the first audio to obtain first audio features;

based on the image features and the voiceprint features, the target image is acquired from the first video features and the target audio is acquired from the first audio features using a diffusion network.

6. The method according to claim 5, wherein the first video feature comprises a plurality of video frame features, and the feature extracting the first video to obtain a first video feature comprises:

carrying out framing treatment on the first video to obtain a plurality of video frames;

And respectively extracting the characteristics of each video frame to obtain the characteristics of the video frame corresponding to each video frame.

7. The method according to claim 6, wherein the first audio feature comprises a plurality of audio frame features, and the feature extracting the first audio to obtain a first audio feature comprises:

framing the first audio to obtain a plurality of first audio frames;

time domain windowing is carried out on each first audio frame respectively to obtain a plurality of second audio frames;

And respectively carrying out short-time Fourier transform on each second audio frame to obtain the audio frame characteristics corresponding to each second audio frame.

8. The video processing method of claim 7, wherein the obtaining the target image from the first video feature and the target audio from the first audio feature using a diffusion network based on the image feature and the voiceprint feature comprises:

fusing the image features and the voiceprint features to obtain fusion features;

And taking the fusion feature, the first video feature and the first audio feature as inputs of the diffusion network, respectively extracting the target image from each video frame feature by using the diffusion network, and respectively extracting the target audio from each audio frame feature.

9. A video processing system, comprising:

the acquisition module is configured to acquire image features and voiceprint features of a target user;

A processing module configured to: acquiring a target image and target audio of the target user from a first media stream by using a diffusion network based on the image features and the voiceprint features; and generating a second media stream based on the target image and the target audio.

10. An electronic device, comprising: a display screen, a memory, and one or more processors; the display screen and the memory are coupled with the processor; wherein the memory has stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the video processing method of any of claims 1-8.

11. A computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the video processing method of any of claims 1-8.