CN116781846A

CN116781846A - High-definition virtual anchor video generation method and device, electronic equipment and storage medium

Info

Publication number: CN116781846A
Application number: CN202310727610.0A
Authority: CN
Inventors: 魏舒; 聂小哲; 肖京; 周超勇; 陈远旭; 胡晶晶
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-19

Abstract

The application relates to the technical field of digital medical treatment, and provides a high-definition virtual anchor video generation method, a device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a target audio signal and a target video signal; performing first preprocessing on the target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on the target video signal to obtain a preprocessed video signal; extracting features of the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; channel dimension fusion processing is carried out on the audio features and the video features to obtain multi-mode features; decoding the multi-mode features to obtain an initial virtual anchor video; and performing resolution improvement processing on the initial anchor video by a pre-training-based deconvolution module to obtain a high-definition virtual anchor video. Through the technical scheme, the resolution ratio of the virtual anchor video can be conveniently, quickly and efficiently improved.

Description

High-definition virtual anchor video generation method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of digital medical treatment, but is not limited to, in particular to a high-definition virtual anchor video generation method, a device, electronic equipment and a computer readable storage medium.

Background

At present, the virtual anchor has been increasingly applied to the digital medical industry because of the characteristics of low cost and stable working quality; the virtual anchor can promote health common sense to people and answer medical health questions proposed by users, so that the satisfaction degree of the clients is well improved; wherein, the virtual anchor is anchor or customer service which uses the virtual image to interact with the customer through the video based on advanced technologies such as audio processing, natural language processing, video generation and the like. The virtual anchor can solve the problems of high cost, unstable working quality and the like of the traditional customer service agent, reduce the customer service cost of a company, improve the customer satisfaction, reduce the complaint rate and ensure the stability of the working quality.

The existing virtual anchor video generation technology is mainly based on deep neural network learning; the deep neural network learns the paired audio samples and video samples, and finally generates speaker video according to audio control. Compared with other methods, the deep learning-based method can generate more realistic character videos, but more training data and calculation resources are needed, and particularly when the output resolution is required to be improved, the resolution of training samples is required to be synchronously improved in the existing method, so that requirements are provided for a large number of high-definition training samples, parameters of a model are greatly increased, the training cost is increased exponentially, and the high-definition virtual anchor generation speed is low.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

In order to solve the problems mentioned in the background art, the embodiment of the application provides a high-definition virtual anchor video generation method, a device, electronic equipment and a computer readable storage medium, which can conveniently, quickly and efficiently improve the resolution of the virtual anchor video and bring better medical and health experience to people.

In a first aspect, an embodiment of the present application provides a method for generating a high-definition virtual anchor video, including:

acquiring a target audio signal and a target video signal;

performing first preprocessing on the target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on the target video signal to obtain a preprocessed video signal;

respectively extracting the characteristics of the preprocessed audio signal and the preprocessed video signal to obtain audio characteristics corresponding to the preprocessed audio signal and video characteristics corresponding to the preprocessed video signal;

carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features;

Decoding the multi-mode features to obtain an initial virtual anchor video;

and performing resolution improvement processing on the initial anchor video based on a pre-trained deconvolution module to obtain a high-definition virtual anchor video.

According to some embodiments of the application, the first preprocessing the target audio signal to obtain a preprocessed audio signal includes:

intercepting the target audio signal and the target video signal for a first time with equal length;

and performing Mel frequency spectrum conversion processing on the target audio signal subjected to the first time equal length interception processing to obtain the preprocessed audio signal.

According to some embodiments of the application, the performing the second preprocessing on the target video signal to obtain a preprocessed video signal includes:

performing second time equal length interception processing on the target video signal and the target audio signal;

performing mask conversion processing on the target video signal subjected to the second time equal-length interception processing to obtain a mask video signal;

and combining the target video signal with the mask video signal to obtain the preprocessing video signal.

According to some embodiments of the present application, the deconvolution module includes a deconvolution layer and a plurality of superimposed convolution layers, and the performing resolution enhancement processing on the initial anchor video by the deconvolution module based on pre-training to obtain a high-definition virtual anchor video includes:

Amplifying the initial anchor video based on the deconvolution layer to obtain a first anchor video;

and adjusting the first anchor video based on the plurality of convolution layers to obtain the high-definition virtual anchor video.

According to some embodiments of the present application, after the pre-training-based deconvolution module performs resolution enhancement processing on the initial anchor video to obtain a high-definition virtual anchor video, the method further includes:

performing loss calculation on the high-definition virtual anchor video based on the target video signal to obtain a video loss value;

performing parameter adjustment processing on a preset video and audio synchronous detection network based on the video loss value;

and performing fidelity detection processing on the high-definition virtual anchor video based on the video and audio synchronous detection network after parameter adjustment processing to obtain a detection result.

According to some embodiments of the present application, the performing channel dimension fusion processing on the audio feature and the video feature to obtain a multi-modal feature includes:

performing first analysis processing on the audio features to obtain a first channel number, and performing second analysis processing on the video features to obtain a second channel number;

And performing channel splicing processing on the audio features and the video features based on the first channel number and the second channel number to obtain the multi-mode features.

According to some embodiments of the present application, the performing mel-frequency spectrum conversion on the aligned target audio signal to obtain the preprocessed audio signal includes:

preprocessing the target audio signal to obtain a first audio signal;

performing Fourier transform processing on the first audio signal to obtain a target amplitude spectrum signal;

and filtering the target amplitude spectrum signal to obtain the preprocessing audio signal.

In a second aspect, an embodiment of the present application further provides a high-definition virtual anchor video generating apparatus, where the apparatus includes:

the first processing module is used for acquiring a target audio signal and a target video signal;

the second processing module is used for performing first preprocessing on the target audio signal to obtain a preprocessed audio signal and performing second preprocessing on the target video signal to obtain a preprocessed video signal;

the third processing module is used for respectively extracting the characteristics of the preprocessed audio signal and the preprocessed video signal to obtain the audio characteristics corresponding to the preprocessed audio signal and the video characteristics corresponding to the preprocessed video signal;

The fourth processing module is used for carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features;

the fifth processing module is used for decoding the multi-mode features to obtain an initial virtual anchor video;

and the sixth processing module is used for carrying out resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module to obtain the high-definition virtual anchor video.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the high definition virtual anchor video generation method as described in the first aspect above when executing the computer program.

In a fourth aspect, an embodiment of the present application further provides a computer readable storage medium storing computer executable instructions for performing the high definition virtual anchor video generating method according to the first aspect above.

The high-definition virtual anchor video generation method provided by the embodiment of the application has at least the following beneficial effects: in the process of generating the high-definition virtual medical anchor video, firstly, acquiring a target audio signal and a target video signal, then, performing first preprocessing on the target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on the target video signal to obtain a preprocessed video signal; then, respectively carrying out feature extraction on the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; then carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features; then decoding the multi-mode features to obtain an initial virtual anchor video; and finally, performing resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module to obtain the high-definition virtual anchor video. Through the technical scheme, the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently and efficiently improved on the basis of less quantity, and better medical health experience is brought to people.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and do not limit the application.

FIG. 1 is a flow chart of a method for generating high definition virtual anchor video according to one embodiment of the present application;

fig. 2 is a flowchart of a first preprocessing of a target audio signal in a high definition virtual anchor video generation method according to an embodiment of the present application;

fig. 3 is a flowchart of performing second preprocessing on a target video signal in the high-definition virtual anchor video generation method according to an embodiment of the present application;

fig. 4 is a flowchart of a resolution enhancement process for an initial anchor video in the method for generating a high-definition virtual anchor video according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for generating high definition virtual anchor video according to another embodiment of the present application;

fig. 6 is a flowchart of a channel dimension fusion process for audio features and video features in the high-definition virtual anchor video generation method according to an embodiment of the present application;

fig. 7 is a flowchart of a mel spectrum conversion process for a target audio signal in the high-definition virtual anchor video generation method according to an embodiment of the present application;

Fig. 8 is a schematic diagram of a high definition virtual anchor video generating apparatus according to an embodiment of the present application;

fig. 9 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in the apparatus schematic and logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than block division in the apparatus or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

It is to be noted that all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

AI is a new technical science to study, develop theories, methods, techniques and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The artificial intelligence is AI, which is the theory, method, technique and application system that uses digital computer or the machine controlled by digital computer to simulate, extend and expand the human intelligence, sense the environment, acquire knowledge and use knowledge to obtain the best result.

The server related to the artificial intelligence technology can be an independent server, or can be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

The application provides a high-definition virtual anchor video generation method, a device, electronic equipment and a computer readable storage medium, wherein in the process of generating a high-definition virtual anchor video, a target audio signal and a target video signal are firstly obtained, then the target audio signal is subjected to first preprocessing to obtain a preprocessed audio signal, and the target video signal is subjected to second preprocessing to obtain a preprocessed video signal; then, respectively carrying out feature extraction on the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; then carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features; then decoding the multi-mode features to obtain an initial virtual anchor video; and finally, performing resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module to obtain the high-definition virtual anchor video. Through the technical scheme, the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently and efficiently improved on the basis of less quantity, and better medical health experience is brought to people.

The embodiment of the application provides a high-definition virtual anchor video generation method, which relates to the technical field of digital medical treatment. The high-definition virtual anchor video generation method provided by the embodiment of the application can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the high-definition virtual anchor video generation method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Embodiments of the present application will be further described below with reference to the accompanying drawings.

As shown in fig. 1, fig. 1 is a flowchart of a high-definition virtual anchor video generation method according to an embodiment of the present application, including but not limited to steps S100 to S600.

Step S100, a target audio signal and a target video signal are obtained;

step S200, performing first preprocessing on a target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on a target video signal to obtain a preprocessed video signal;

Step S300, respectively extracting the characteristics of the preprocessed audio signal and the preprocessed video signal to obtain the audio characteristics corresponding to the preprocessed audio signal and the video characteristics corresponding to the preprocessed video signal;

step S400, carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features;

step S500, decoding the multi-mode features to obtain an initial virtual anchor video;

and step S600, performing resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module to obtain the high-definition virtual anchor video.

In the process of generating the high-definition virtual medical anchor video, firstly, acquiring a target audio signal and a target video signal, then, performing first preprocessing on the target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on the target video signal to obtain a preprocessed video signal; then, respectively carrying out feature extraction on the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; then carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features; then decoding the multi-mode features to obtain an initial virtual anchor video; and finally, performing resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module to obtain the high-definition virtual anchor video. Through the technical scheme, the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently and efficiently improved on the basis of less quantity, and better medical health experience is brought to people.

It is worth noting that in the process of answering the medical related questions presented by the user or transmitting the related medical health information to the user by using the virtual anchor at present, as the resolution of the virtual anchor at present is low, the requirement of the user on the high resolution of the video cannot be met well, so that the embodiment of the application can acquire the target audio signal and the target video signal first, then perform the first preprocessing on the target audio signal to obtain the preprocessed audio signal, and perform the second preprocessing on the target video signal to obtain the preprocessed video signal; then, respectively carrying out feature extraction on the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; then carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features; then decoding the multi-mode features to obtain an initial virtual anchor video; finally, performing resolution improvement processing on the initial anchor video based on a pre-trained deconvolution module to obtain a high-definition virtual anchor video; the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently, quickly and efficiently improved on the basis of a small number, and the requirement of users on the high resolution of the virtual anchor in the digital medical industry can be well met; for example, in an on-line consultation scene, a virtual anchor can be utilized to answer related questions of medical health presented by a user, or to popularize some medical health common knowledge to the user, or to play some recent medical health events to the user, so that great convenience is brought to the user, and good use experience is brought to the user.

It will be appreciated that the virtual anchor is an anchor or customer that interacts with the customer through video using an avatar based on techniques such as audio processing, natural language processing, and video generation; the virtual anchor can solve the problems of high management and training cost, long training period, unstable working quality, easiness in being influenced by emotion and fatigue and the like of the traditional customer service, reduce the customer service cost of a company, improve the customer satisfaction, reduce the complaint rate and ensure the stability of the working quality.

For example, for a virtual anchor in the current digital medical industry, because the video resolution of the virtual anchor is low, poor use experience is brought to people, so that the high-definition virtual anchor video generation method in the embodiment of the application carries out resolution improvement processing on an initial anchor video with low resolution through a pre-trained deconvolution module to obtain a high-definition virtual anchor video; the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently, quickly and efficiently improved on the basis of a small number, and the requirement of users on the high resolution of the virtual anchor in the digital medical industry can be well met; in the process of transmitting the sanitary and health information to the user by utilizing the high-resolution virtual anchor, the user has better use experience.

It can be understood that the target audio signal and the target video signal are the corresponding audio signal and the corresponding video signal to be synthesized; and the corresponding audio signal and video signal are processed through relevant synthesis operation to obtain the initial virtual anchor video. The method comprises the steps of performing first preprocessing on a target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on a target video signal to obtain a preprocessed video signal; the preprocessing audio signal and the preprocessing video signal are respectively obtained through the preprocessing operation, so that the precondition preparation is made for the subsequent virtual anchor video synthesis.

It is noted that the feature extraction processing is performed on the preprocessed audio signal and the preprocessed video signal, respectively, so that the audio feature corresponding to the preprocessed audio signal and the video feature corresponding to the preprocessed video signal can be obtained; and extracting the characteristics of the preprocessed audio signals and the preprocessed video signals, and preparing for the subsequent characteristic fusion. In the feature extraction process of the pre-processed audio signal and the pre-processed video signal, the relevant information or data may be encoded by an encoder, and the relevant information or data may be regarded as being subjected to a more complex encoding process when the relevant information or data is subjected to the feature extraction.

It should be noted that, after the feature extraction is performed on the preprocessed audio signal to obtain the audio feature, the feature extraction is performed on the preprocessed video signal to obtain the video feature, the channel dimension fusion processing is further required to be performed on the audio feature and the video feature, that is, the audio feature and the video feature are subjected to the splicing fusion processing, so that preparation is made on the premise of the subsequent feature decoding, and the feature is decoded into the task-related output. In the embodiment of the application, the initial virtual anchor video can be obtained by decoding the multi-mode features.

It can be appreciated that in order to generate a virtual anchor video with higher resolution, the face and lip are more natural, the embodiment of the application can generate a high resolution video with more natural face and lip without changing the input size by adding an additional deconvolution module in the decoder generating the resistive network generator based on the mouth-type synchronization model. According to the embodiment of the application, the resolution of the generated virtual anchor video can be improved by several times by only adding a small amount of parameters, the lip action is more natural, the image quality is clearer, the task of interaction with a user is better met, and the interaction experience and satisfaction which are closer to a real person are brought to the user. The input of the mouth-shaped synchronous model is divided into an audio part and a video part, the audio part and the video part are required to be aligned before training, the preprocessed audio part is in a Mel frequency spectrum format, the preprocessed video part is masked, and the mask and the other video part which is randomly sampled form a video sample together; then extracting audio characteristics from the audio through a voice coder, and extracting visual characteristics from the video through a video coder; then, the extracted audio features and visual features are connected and fused along the channel dimension to obtain multi-mode features, and then the multi-mode features are input to a decoder to obtain an initial virtual anchor video; and finally, an additional deconvolution module is added to the decoder, wherein the deconvolution module comprises deconvolution layers and a plurality of convolution layers, the additionally added deconvolution layers can amplify the generated video to several times of the input video, and the later convolution layers can ensure that the result has more details and more realistic lip actions. The voice encoder, the video encoder and the decoder jointly form a generator with residual connection, the video output by the generator is firstly subjected to loss calculation with the target video input with the mouth-shaped synchronous model, then the lip and surrounding actions are detected to be in accordance with the audio data through a pre-trained mouth-shaped synchronous detection network, and the overall fidelity of the video is detected through a discriminator.

Notably, in the process of testing and reasoning the mouth-shaped synchronous model, a mouth-shaped synchronous detection network and a discriminator are not required, and only a generator is required to be used; the video input is the video of the target task, because the input of the original video encoder is a normal video and another masked video, the input video here will duplicate a masked copy, and the audio input is the target audio; as in the forward process during training, audio frequency is extracted through a voice coder, and video frequency is extracted through a video coder; and then fusing the extracted audio features and video features along the channel dimension, inputting the audio features and the video features as multi-mode features to a decoder, and finally obtaining the high-definition virtual anchor video through an additional deconvolution module, wherein the resolution of the high-definition virtual anchor video is several times that of the original input video.

In some embodiments, as shown in fig. 2, the step S200 may include, but is not limited to, steps S210 to S220.

Step S210, intercepting and processing target audio signals and target video signals in a first time equal length;

step S220, the target audio signal after the first time equal length interception processing is subjected to Mel frequency spectrum conversion processing to obtain a preprocessed audio signal.

In the process of performing the first preprocessing on the target audio signal, the target audio signal and the target video signal need to be aligned, that is, the target audio signal and the target video signal need to be intercepted for a first time with equal length; and then, carrying out Mel frequency spectrum conversion processing on the aligned target audio signal to obtain a preprocessed audio signal.

It is noted that, the target audio signal and the target video signal are subjected to the first time equal length interception processing, so as to prepare for the subsequent video and audio synthesis. The aligned target audio signal is subjected to mel spectrum conversion processing to obtain a preprocessed audio signal, which is prepared for subsequent feature extraction. The alignment of the target audio signal and the target video signal may also facilitate subsequent feature fusion processing. The sound to be made by the high-definition virtual anchor in the later digital medical industry can be obtained according to the pre-processed audio signal.

In some embodiments, as shown in fig. 3, the step S200 may further include, but is not limited to, step S230 to step S250.

Step S230, intercepting the target video signal and the target audio signal for a second time with equal length;

Step S240, performing mask conversion processing on the target video signal subjected to the second time equal length interception processing to obtain a mask video signal;

step S250, the target video signal and the mask video signal are combined to obtain a preprocessed video signal.

In the process of performing the second preprocessing on the target video signal, firstly performing a second time equal length interception processing on the target video signal and the target audio signal, and then performing a mask conversion processing on the target video signal after the second time equal length interception processing to obtain a mask video signal; finally, the target video signal and the mask video signal are combined to obtain a preprocessed video signal.

It is noted that the target video signal and the target audio signal are subjected to a second time equal length interception process, and preparation is made for subsequent feature fusion. Performing mask conversion processing on the target video signal after the second time equal-length interception processing to obtain a mask video signal; finally, the target video signal and the mask video signal are combined to obtain a preprocessed video signal. The method comprises the step of combining a target video signal and a mask video signal, namely splicing and fusing the target video signal and the mask video signal.

It should be noted that the first time equal length interception and the second time equal length interception in the embodiments of the present application are only for distinguishing different embodiments, and do not represent the manners or time differences of interception of the two embodiments, and the distinguishing between the first time equal length interception and the second time equal length interception is only for more clearly describing and illustrating the embodiments of the present application. The preprocessing video signal can be obtained by combining the current video signal and the mask video signal, and the preprocessing video signal is ready for generating a high-definition virtual anchor video of the subsequent digital medical industry.

In some embodiments, as shown in fig. 4, the deconvolution module includes a deconvolution layer and a plurality of superimposed convolution layers, and the step S600 may include, but is not limited to, steps S610 to S620.

Step S610, amplifying the initial anchor video based on the deconvolution layer to obtain a first anchor video;

and step S620, adjusting the first anchor video based on the plurality of convolution layers to obtain a high-definition virtual anchor video.

In the process of performing resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module, the first anchor video can be obtained by performing amplification processing on the initial anchor video based on the deconvolution layer; and then, adjusting the first anchor video based on a plurality of convolution layers to obtain the high-definition virtual anchor video. The deconvolution module comprises a deconvolution layer and a plurality of convolution layers, and as with the deconvolution layer in the deconvolution module and the deconvolution layer in the decoder, the additionally added deconvolution layer can amplify the generated video to several times of the input video, and the later convolution layers can also ensure that the result has more details and more realistic lip actions. For example, when the resolution of the input video is 96×96, the original video synthesis method is used to generate only the result with the same resolution as the input video, so that the video with 96×96 resolution is output, but the deconvolution module according to the embodiment of the present application can perform the enhancement processing on the resolution of the synthesized video, and output the high-definition video with 192×192 resolution, and simultaneously, since the semantic segmentation backbone network fully extracts the high-level semantic information of the input picture through the bottleneck layer, the detail texture information of the input picture is reserved through the residual connection, the result with 192×192 resolution output by the embodiment of the present application is not only the simple super-resolution reconstruction, but also the high-definition picture is generated under the guidance of the high-level semantic information. By the technical scheme, the generated video has higher resolution, more natural lip action and more realistic visual effect only by adding a small amount of parameters and calculated amount; the resolution of 192 x 192 is realized by the high-definition virtual anchor video, so that the video is improved from 2K resolution to 4K resolution, and the user has closer experience to a real person and higher satisfaction.

Notably, the first anchor video can be obtained by amplifying the initial anchor video by using the deconvolution layer; then, based on a plurality of convolution layers, adjusting the first anchor video to obtain a high-definition virtual anchor video; in the virtual anchor video of the current digital medical industry, the resolution of the current virtual anchor video is low, the resolution of the virtual anchor video can be well improved through the deconvolution module, and good experience is brought to users, so that medical health information can be played to the users through the virtual anchor, relevant health common sense is popularized, voices of the users can be received and identified, and questions presented by the users can be answered.

In some embodiments, as shown in fig. 5, steps S710 to S720 may be included, but are not limited to, after step S600 is performed.

Step S710, performing loss calculation on the high-definition virtual anchor video based on the target video signal to obtain a video loss value;

step S720, carrying out parameter adjustment processing on a preset video and audio synchronous detection network based on the video loss value;

and step S730, performing fidelity detection processing on the high-definition virtual anchor video based on the video and audio synchronous detection network after the parameter adjustment processing to obtain a detection result.

After the high-definition virtual anchor video is obtained, loss calculation can be performed on the high-definition virtual anchor video based on the target video signal to obtain a video loss value; and then carrying out parameter adjustment processing on a preset video and audio synchronous detection network based on the video loss value, and finally carrying out fidelity detection processing on the high-definition virtual anchor video based on the video and audio synchronous detection network subjected to the parameter adjustment processing to obtain a corresponding detection result so as to verify whether the obtained high-definition virtual anchor video has a situation that the sound is not matched with the mouth shape.

It can be appreciated that the video loss value can be obtained by performing loss calculation on the high-definition virtual anchor video based on the target video signal, wherein the video loss value is used for representing the difference between the target video signal and the high-definition virtual anchor video. Performing parameter adjustment processing on the video and audio synchronous detection network based on the video loss value; after the video and audio synchronous detection network carries out parameter adjustment processing, the obtained high-definition virtual anchor video can be subjected to fidelity detection processing, and finally a corresponding detection result is obtained. By the technical means, the high-definition virtual anchor video in the medical industry can be detected and verified.

In some embodiments, as shown in fig. 6, the step S400 may include, but is not limited to, steps S410 to S420.

Step S410, performing a first analysis processing on the audio feature to obtain a first channel number, and performing a second analysis processing on the video feature to obtain a second channel number;

step S420, channel splicing processing is carried out on the audio features and the video features based on the first channel number and the second channel number to obtain multi-mode features.

In the process of fusion processing of the audio features and the video features, the first analysis processing is performed on the audio features to obtain the first channel number, and the second analysis processing is performed on the video features to obtain the second channel number; and finally, carrying out channel splicing processing on the audio features and the video features based on the first channel number and the second channel number to obtain the multi-mode features.

It can be understood that firstly, the audio feature and the video feature are analyzed and processed to obtain the corresponding channel number respectively; and then carrying out channel splicing processing on the audio features and the video features according to the corresponding channel number to obtain the multi-mode features.

In some embodiments, as shown in fig. 7, the step S220 may include, but is not limited to, step S221 to step S223.

Step S221, preprocessing the target audio signal to obtain a first audio signal;

step S222, carrying out Fourier transform processing on the first audio signal to obtain a target amplitude spectrum signal;

step S223, filtering the target amplitude spectrum signal to obtain a preprocessed audio signal.

In the process of performing mel spectrum conversion processing on a target audio signal, firstly, preprocessing the target audio signal to obtain a first audio signal; then, carrying out Fourier transform processing on the first audio signal to obtain a target amplitude spectrum signal; finally, the target amplitude spectrum signal is filtered to obtain a preprocessed audio signal. The sound played by the high-definition virtual anchor video in the digital medical industry is obtained based on the preprocessed audio signal.

In addition, as shown in fig. 8, an embodiment of the present application further provides a high-definition virtual anchor video generating apparatus 10, including:

a first processing module 100, configured to acquire a target audio signal and a target video signal;

the second processing module 200 is configured to perform a first preprocessing on the target audio signal to obtain a preprocessed audio signal, and perform a second preprocessing on the target video signal to obtain a preprocessed video signal;

A third processing module 300, configured to perform feature extraction on the preprocessed audio signal and the preprocessed video signal, respectively, to obtain an audio feature corresponding to the preprocessed audio signal and a video feature corresponding to the preprocessed video signal;

a fourth processing module 400, configured to perform channel dimension fusion processing on the audio feature and the video feature to obtain a multi-mode feature;

a fifth processing module 500, configured to decode the multi-mode feature to obtain an initial virtual anchor video;

and a sixth processing module 600, configured to perform resolution enhancement processing on the initial anchor video based on the pre-trained deconvolution module to obtain a high-definition virtual anchor video.

In the process of generating the high-definition virtual anchor video, firstly, acquiring a target audio signal and a target video signal, then, performing first preprocessing on the target audio signal to obtain a preprocessed audio signal, and performing second preprocessing on the target video signal to obtain a preprocessed video signal; then, respectively carrying out feature extraction on the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; then carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features; then decoding the multi-mode features to obtain an initial virtual anchor video; and finally, performing resolution improvement processing on the initial anchor video based on the pre-trained deconvolution module to obtain the high-definition virtual anchor video. Through the technical scheme, the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently and efficiently improved on the basis of less quantity, and better medical health experience is brought to people.

It is worth noting that in the process of answering the medical related questions presented by the user or transmitting medical health information to the user by using the virtual anchor at present, as the resolution of the virtual anchor at present is low, the requirement of the user on the high resolution of the video cannot be well met, so that the embodiment of the application can firstly acquire the target audio signal and the target video signal, then perform first preprocessing on the target audio signal to obtain a preprocessed audio signal, and perform second preprocessing on the target video signal to obtain a preprocessed video signal; then, respectively carrying out feature extraction on the preprocessed audio signal and the preprocessed video signal to obtain audio features corresponding to the preprocessed audio signal and video features corresponding to the preprocessed video signal; then carrying out channel dimension fusion processing on the audio features and the video features to obtain multi-mode features; then decoding the multi-mode features to obtain an initial virtual anchor video; finally, performing resolution improvement processing on the initial anchor video based on a pre-trained deconvolution module to obtain a high-definition virtual anchor video; the feature layer is amplified by the deconvolution module, so that the resolution of the virtual anchor video can be conveniently, quickly and efficiently improved on the basis of a small number, and the requirement of a user on the high resolution of the virtual anchor in the on-line medical consultation process can be well met.

The specific implementation of the high-definition virtual anchor video generating apparatus 10 is substantially the same as the specific embodiment of the high-definition virtual anchor video generating method described above, and will not be described herein.

In addition, as shown in fig. 9, an embodiment of the present application further provides an electronic device 700, including: memory 720, processor 710, and computer programs stored on memory 720 and executable on processor 710.

Processor 710 and memory 720 may be connected by a bus or other means.

The non-transitory software program and instructions required to implement the high definition virtual anchor video generating method of the above embodiments are stored in the memory 720, and when executed by the processor 710, the high definition virtual anchor video generating method of the above embodiments is performed, for example, the method steps S100 to S600 in fig. 1, the method steps S210 to S220 in fig. 2, the method steps S230 to S250 in fig. 3, the method steps S610 to S620 in fig. 4, the method steps S710 to S730 in fig. 5, the method steps S410 to S420 in fig. 6, and the method steps S221 to S223 in fig. 7 described above are performed.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, an embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions that are executed by a processor 710 or a controller, for example, by the processor 710 in the above-described device embodiment, which may cause the processor 710 to perform the high-definition virtual anchor video generation method in the above-described embodiment, for example, the method steps S100 to S600 in fig. 1, the method steps S210 to S220 in fig. 2, the method steps S230 to S250 in fig. 3, the method steps S610 to S620 in fig. 4, the method steps S710 to S730 in fig. 5, the method steps S410 to S420 in fig. 6, and the method steps S221 to S223 in fig. 7 described above.

The embodiments described above may be combined, and modules with the same names may be the same or different between different embodiments.

The foregoing describes certain embodiments of the application, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings do not necessarily have to be in the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for apparatus, devices, computer readable storage medium embodiments, the description is relatively simple as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.

The apparatus, the device, the computer readable storage medium and the method provided by the embodiments of the present application correspond to each other, and therefore, the apparatus, the device, the non-volatile computer storage medium also have similar beneficial technical effects as those of the corresponding method, and since the beneficial technical effects of the method have been described in detail above, the beneficial technical effects of the corresponding apparatus, device, and computer storage medium are not described here again.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or Flash memory (Flash RAM), among others, in a computer readable medium. Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable Media, as defined herein, does not include Transitory computer-readable Media (transmission Media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Embodiments of the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Embodiments of the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only exemplary embodiments of the application and is not intended to limit the application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method for generating a high definition virtual anchor video, the method comprising:

acquiring a target audio signal and a target video signal;

decoding the multi-mode features to obtain an initial virtual anchor video;

2. The method for generating high-definition virtual anchor video according to claim 1, wherein the performing the first preprocessing on the target audio signal to obtain a preprocessed audio signal comprises:

3. The method for generating high-definition virtual anchor video according to claim 1, wherein the performing the second preprocessing on the target video signal to obtain a preprocessed video signal comprises:

4. The method for generating the high-definition virtual anchor video according to claim 1, wherein the deconvolution module comprises a deconvolution layer and a plurality of superimposed convolution layers, and the performing resolution enhancement processing on the initial anchor video by the deconvolution module based on pre-training to obtain the high-definition virtual anchor video comprises:

5. The method for generating high-definition virtual anchor video according to claim 1, wherein after the pre-training-based deconvolution module performs resolution enhancement processing on the initial anchor video to obtain the high-definition virtual anchor video, the method further comprises:

6. The method for generating the high-definition virtual anchor video according to claim 1, wherein the performing channel dimension fusion processing on the audio feature and the video feature to obtain the multi-modal feature comprises:

7. The method for generating high-definition virtual anchor video according to claim 2, wherein the performing mel-frequency spectrum conversion on the aligned target audio signal to obtain the preprocessed audio signal includes:

preprocessing the target audio signal to obtain a first audio signal;

8. A high definition virtual anchor video generating apparatus, the apparatus comprising:

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the high definition virtual anchor video generating method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing computer-executable instructions for performing the high definition virtual anchor video generation method of any one of claims 1 to 7.