CN113709401A

CN113709401A - Video call method, device, storage medium, and program product

Info

Publication number: CN113709401A
Application number: CN202111057422.9A
Authority: CN
Inventors: 黄文赞; 黄启军; 黄铭毅; 林江淼; 陈振南
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-11-26

Abstract

The invention discloses a video call method, a device, a storage medium and a program product, wherein the method comprises the following steps: acquiring a low-resolution video image and hyper-resolution model information sent by second equipment participating in a video call, wherein the second equipment processes collected call video data of the call into a low-resolution video image and sends the low-resolution video image to first equipment, and after determining a target speaker participating in the call by using the second equipment according to the call video data, acquiring hyper-resolution model information according to the target speaker and sending the hyper-resolution model information to the first equipment; obtaining a target hyper-score model aiming at the target speaker based on the hyper-score model information; and performing super-resolution reconstruction on the low-resolution video image by adopting a target super-resolution model to obtain a first high-resolution video image and outputting the first high-resolution video image. The invention can realize real-time smooth playing on the premise of better video quality restoration.

Description

Video call method, device, storage medium, and program product

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a video call method, a device, a storage medium, and a program product.

Background

At present, due to the limitation of transmission bandwidth and the requirement of real-time performance, the resolution of a video transmitted to a call counterpart in the video call process is generally low, and under the condition that the resolution of a terminal display screen of the video call counterpart is high, the problem that the resolution of the transmitted video is not matched with that of the terminal display screen occurs. In order to solve the problem, a super-resolution model based on deep learning can be adopted to carry out super-resolution reconstruction on the low-resolution image, so that the details of the image are effectively restored, and the high-resolution image with high definition is obtained. However, the super-resolution model with good super-resolution reconstruction effect is often large in model scale and slow in execution speed, and the processing time of a single-frame image is too long when a video call scene is applied, so that real-time smooth playing on the premise of good video quality restoration is difficult to achieve.

Disclosure of Invention

The invention mainly aims to provide a video call method, a video call device, a video call storage medium and a video call program product, and aims to solve the technical problem that real-time smooth playing on the premise of good video quality restoration is difficult to achieve when a super-resolution model based on deep learning is applied to video call at present.

In order to achieve the above object, the present invention provides a video call method, which is applied to a first device participating in a video call, and comprises the following steps:

acquiring a low-resolution video image and hyper-resolution model information sent by second equipment participating in a video call, wherein the second equipment processes collected call video data of the call into the low-resolution video image and sends the low-resolution video image to the first equipment, and after determining a target speaker who uses the second equipment to participate in the call according to the call video data, the hyper-resolution model information is acquired according to the target speaker and sent to the first equipment;

obtaining a target hyper-score model for the target speaker based on the hyper-score model information;

and performing super-resolution reconstruction on the low-resolution video image by adopting the target super-resolution model to obtain a first high-resolution video image and outputting the first high-resolution video image, wherein the resolution of the first high-resolution video image is higher than that of the low-resolution video image.

Optionally, when the hyper-score model information is a target model parameter, the step of obtaining a target hyper-score model for the target speaker based on the hyper-score model information includes:

replacing model parameters at a preset position in a first general hyper-resolution model with the target model parameters to obtain a target hyper-resolution model for the target speaker;

the target model parameters are model parameters at the preset position in a target speaker model extracted and obtained by the second device, the target speaker model is a hyper-segmentation model obtained by the second device through training a second general hyper-segmentation model by using a video image corresponding to the target speaker in advance, and the first general hyper-segmentation model and the second general hyper-segmentation model are hyper-segmentation models obtained through training by using preset general image training data and preset in the first device and the second device respectively.

Optionally, when the hyper-score model information is a second high-resolution video image, the step of obtaining a target hyper-score model for the target speaker based on the hyper-score model information includes:

preprocessing the second high-resolution video image to obtain target training data;

when the second device does not find the target speaker model corresponding to the target speaker in a preset model base, the second high-resolution video image is a high-resolution video image extracted from the call video data, the resolution of the second high-resolution video image is higher than that of the low-resolution video image, the preset model base is used for storing a speaker model corresponding to a historical speaker, the historical speaker is a speaker who uses the second device to participate in video call and is determined by the second device according to historical call video data, and the speaker model corresponding to the historical speaker is obtained by the second device through training of the second device on the second common hyper-resolution model by using the video image corresponding to the historical speaker;

and training the first general hyper-score model by adopting the target training data to obtain a target hyper-score model aiming at the target speaker.

Optionally, the step of training the first general hyper-score model by using the target training data to obtain a target hyper-score model for the target speaker includes:

and training the first general hyper-resolution model by using the target training data as a training set, verifying by using the general image training data as a verification set, and stopping training after training for a preset time when detecting that the super-resolution reconstruction accuracy of the first general hyper-resolution model on the training set is greater than that of the first general hyper-resolution model on the verification set, so as to obtain the target hyper-resolution model for the target speaker.

In order to achieve the above object, the present invention further provides a video call method, which is applied to a second device participating in a video call, and the method includes the following steps:

collecting call video data of the call, and processing the call video data into a low-resolution video image;

determining a target speaker who uses the second device to participate in the call according to the video call data, and acquiring hyper-resolution model information according to the target speaker;

and sending the low-resolution video image and the hyper-resolution model information to the first equipment so that the first equipment can obtain a target hyper-resolution model for the target speaker based on the hyper-resolution model information, and performing super-resolution reconstruction on the low-resolution video image by adopting the target hyper-resolution model to obtain and output a first high-resolution video image, wherein the resolution of the first high-resolution video image is higher than that of the low-resolution video image.

Optionally, the step of obtaining the hyper-resolution model information according to the target speaker includes:

searching whether a target speaker model corresponding to the target speaker exists in a preset model base, wherein the preset model base is used for storing a speaker model corresponding to a historical speaker, the historical speaker is a speaker who uses the second device to participate in video call and is determined by the second device according to historical call video data, the speaker model corresponding to the historical speaker is obtained by the second device through training a second common hyper-resolution model by using a video image corresponding to the historical speaker, and the second common hyper-resolution model is a hyper-resolution model which is preset in the second device and is obtained through training by using preset common image training data;

if the target speaker model is not found, extracting a second high-resolution video image from the call video data as the hyper-resolution model information so that the first device can preprocess the second high-resolution video image to obtain target training data, and training the first general hyper-resolution model by using the target training data to obtain a target hyper-resolution model for the target speaker, wherein the resolution of the second high-resolution video image is higher than that of the low-resolution video image.

Optionally, after the step of searching whether the preset model library has the target speaker model corresponding to the target speaker, the method further includes:

if the target speaker model is found, taking a model file of the target speaker model as the hyper-resolution model information so that the first device can take the target speaker model in the model file as a target hyper-resolution model for the target speaker; or the like, or, alternatively,

if the target speaker model is found, extracting target model parameters at a preset position in the target speaker model as hyper-resolution model information, so that the first equipment replaces the model parameters at the preset position in a first general hyper-resolution model with the target model parameters to obtain a target hyper-resolution model for the target speaker; the target speaker model is a hyper-resolution model obtained by the second device through training the second general hyper-resolution model by using a video image corresponding to the target speaker in advance, and the first general hyper-resolution model is a hyper-resolution model obtained by training preset general image training data in the first device.

Optionally, the step of determining a target speaker using the second device to participate in a video call according to the video call data includes:

and extracting a video image from the video call data, and carrying out face recognition on the video image to obtain a target call participant using the second equipment to participate in the video call.

In order to achieve the above object, the present invention further provides a video call device, where the video call device is disposed in a first device, and the video call device includes:

the system comprises a first acquisition module, a second acquisition module and a first processing module, wherein the first acquisition module is used for acquiring a low-resolution video image and hyper-resolution model information sent by second equipment participating in video call, the second equipment processes collected call video data of the call into the low-resolution video image and then sends the low-resolution video image to the first equipment, and after determining a target speaker who uses the second equipment to participate in the call according to the call video data, the first acquisition module acquires the hyper-resolution model information according to the target speaker and sends the hyper-resolution model information to the first equipment;

the second acquisition module is used for acquiring a target hyper-score model aiming at the target speaker based on the hyper-score model information;

and the output module is used for performing super-resolution reconstruction on the low-resolution video image by adopting the target super-resolution model to obtain a first high-resolution video image and outputting the first high-resolution video image, wherein the resolution of the first high-resolution video image is higher than that of the low-resolution video image.

In order to achieve the above object, the present invention further provides a video call device, where the video call device is disposed in a second device, and the video call device includes:

the acquisition module is used for acquiring call video data of the call and processing the call video data into a low-resolution video image;

the determining module is used for determining a target speaker who uses the second equipment to participate in the call according to the video call data and acquiring the hyper-resolution model information according to the target speaker;

and the sending module is used for sending the low-resolution video image and the hyper-resolution model information to the first equipment so that the first equipment can obtain a target hyper-resolution model aiming at the target speaker based on the hyper-resolution model information, and performing super-resolution reconstruction on the low-resolution video image by adopting the target hyper-resolution model to obtain a first high-resolution video image and output the first high-resolution video image, wherein the resolution of the first high-resolution video image is higher than that of the low-resolution video image.

To achieve the above object, the present invention further provides a video call device, including: the device comprises a memory, a processor and a video call program stored on the memory and capable of running on the processor, wherein the video call program realizes the steps of the video call method when being executed by the processor.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, which stores a video call program, and when the video call program is executed by a processor, the video call program implements the steps of the video call method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer program product, which includes a computer program, and the computer program implements the steps of the video call method as described above when being executed by a processor.

In the invention, a first device acquires a low-resolution video image and hyper-resolution model information sent by a second device, wherein the second device processes collected call video data of the call into the low-resolution video image and sends the low-resolution video image to the first device, and after determining a target speaker who uses the second device to participate in the call according to the call video data, the first device acquires the hyper-resolution model information according to the target speaker and sends the hyper-resolution model information to the first device; the first equipment obtains a target hyper-score model aiming at the target speaker based on the hyper-score model information; the method and the device have the advantages that the target hyper-resolution model of the speaker for the video call is dynamically acquired in the video call process, and the target hyper-resolution model is the hyper-resolution model of the speaker for the video call, so that higher reconstruction accuracy can be acquired when a model structure with a smaller scale is adopted as the model structure of the target hyper-resolution model and the target hyper-resolution model is adopted by the second device to perform super-resolution reconstruction on a low-resolution video image of the target speaker, and real-time smooth playing is realized on the premise of better video quality restoration.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a video call method according to a first embodiment of the present invention;

fig. 3 is a schematic flow chart of a video call between two devices according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that, the video call device in the embodiment of the present invention may be a smart phone, a personal computer, a server, and the like, and is not limited herein.

As shown in fig. 1, the video call apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the device architecture shown in fig. 1 does not constitute a limitation of video telephony devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a video call program. The operating system is a program that manages and controls the hardware and software resources of the device, supporting the operation of the video call program as well as other software or programs. In the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a server; and the processor 1001 may be configured to call the video call program stored in the memory 1005 and perform the following operations according to the embodiments of the video call method of the present invention.

Based on the above structure, various embodiments of a video call method are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a video call method according to a first embodiment of the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein. In this embodiment, the video call method is applied to a first device participating in a video call, where the first device is in communication connection with a second device participating in the video call, and the first device and the second device may be devices such as a smart phone, a personal computer, and a server, which are not limited in this embodiment. In this embodiment, the video call method includes:

step S10, acquiring a low-resolution video image and hyper-resolution model information sent by a second device participating in a video call, wherein the second device processes collected call video data of the call into the low-resolution video image and sends the low-resolution video image to the first device, and after determining a target speaker who uses the second device to participate in the call according to the call video data, the second device acquires the hyper-resolution model information according to the target speaker and sends the hyper-resolution model information to the first device;

in this embodiment, a device that receives and displays video data sent by another device among two or more devices participating in a video call is referred to as a first device, and a device that collects video data of a caller and sends the video data to another device is referred to as a second device. It can be understood that, in two or more parties participating in a video call, if one device receives video data sent by another device and sends video data to another device, the device may serve as a first device to perform operations of the first device in the video call method of this embodiment, and may serve as a second device to perform operations of the second device in the video call method of this embodiment, operations performed as two roles are not conflicting, may be performed in parallel or in series, and may be specifically set according to actual requirements. For example, in an embodiment, a flow of a video call between a first device and a second device may refer to fig. 3; in fig. 3, the left and right modules represent a device participating in a video call, and in the upper and lower flow lines, the two devices respectively adopt cameras to collect video data and send low-resolution video data to the other party through a network interface, and the other party performs super-resolution reconstruction on the low-resolution video data by adopting a neural network model (i.e., a super-resolution model) after receiving the low-resolution video data to obtain high-resolution video data and outputs the high-resolution video data to a screen for playing; in the upper flow line, the left device serves as the second device and the right device serves as the first device, and in the lower flow line, the left device serves as the first device and the right device serves as the second device.

After the first device and the second device establish video call connection, the second device can acquire call video data of the call through camera equipment in the second device or camera equipment externally connected with the second device. The video call data includes that the camera device collects video frames (referred to as video images in this embodiment) according to a certain collection frequency in the video call process. The second device processes the video call data into a low-resolution video image and sends the low-resolution video image to the first device. It should be noted that the resolution of the video image acquired by the general camera device is higher than the resolution of the video image transmitted between the devices in the video call process, so in this embodiment, the video image in the video call data is referred to as a high-resolution video image, and in an application scene, due to the limitation of network bandwidth or the requirement for real-time performance, and the like, when the second device transmits the call video data to the first device, the second device needs to process the high-resolution video image into a low-resolution video image and then transmit the low-resolution video image to the first device. Specifically, the method for processing the high-resolution video image into the low-resolution video image may refer to an existing video call method, which is not described in detail herein. In some embodiments, to achieve a real-time call effect, the second device may process each frame of high-resolution video image collected by the camera device into a low-resolution video image in real time, and send the low-resolution video image to the first device in real time. In some embodiments, when the requirement on the real-time call effect is not strict, the second device may also process multiple frames of continuous high-resolution video images collected by the camera device together and send the processed images to the first device.

The second device may determine a speaker participating in the call using the second device (hereinafter, the speaker is referred to as a target speaker for distinction) according to the video call data. Specifically, the second device may perform face recognition on a video image in the video call data to obtain a target call recipient. The face recognition method may refer to an existing face recognition method, and is not limited in this embodiment. After the second device identifies the target speaker, the second device can obtain the hyper-resolution model information according to the target speaker and send the hyper-resolution model information to the first device. The hyper-resolution model information may be a model file of a hyper-resolution model (short for a super-resolution model) obtained by the second device through video image training of the target speaker in advance, or a part of model parameters in the hyper-resolution model, or a high-resolution video image extracted from video call data, or a high-resolution video image of the target speaker collected in advance. Specifically, in this embodiment, the hyper-score model information is not limited, as long as the first device can obtain the target hyper-score model for the target talker according to the hyper-score model information; for example, when the hyper-score model information is a model file for a hyper-score model obtained by training in advance using the video image of the target speaker, the first device may directly use the hyper-score model in the model file as the target hyper-score model, and when the hyper-score model information is a high-resolution video image of the target speaker, the first device may train using the high-resolution video image as training data to obtain the target hyper-score model. The super-resolution model is a neural network model used for carrying out super-resolution reconstruction on the low-resolution image to obtain a high-resolution image; in this embodiment, a model structure with a smaller model scale may be adopted in order to improve efficiency in super-resolution reconstruction in some embodiments, for example, the super-resolution model with the EDSR structure may be a model structure whose model size can be adjusted by adjusting the numbers of ResBlocks and Filters, and then the EDSR model structure with a smaller number of ResBlocks and Filters may be adopted as the model structure of the super-resolution model in this embodiment.

In an embodiment, after the video call is established, the second device may perform face recognition on the first frame of video image acquired by the camera device to determine a speaker, use the speaker as a target speaker, and obtain the hyper-resolution model information according to the target speaker and send the hyper-resolution model information to the first device; then, carrying out face recognition on each frame or every several frames of subsequently acquired video images; if the identified speaker is different from the current target speaker, updating the target speaker to be a newly identified speaker, and acquiring new hyper-resolution model information according to the updated target speaker and sending the new hyper-resolution model information to the first equipment; and if the identified speaker is the same as the current target speaker, the target speaker is not updated, and new hyper-resolution model information is not sent to the first device. Further, if the second device fails to perform face recognition on the first frame of video image, that is, when the face in the image is not recognized, the face recognition can be performed on the second frame of video image, and so on, until the face is recognized, the recognized face is used as the first recognized caller in the call.

And the first equipment receives the low-resolution video image and the hyper-resolution model information sent by the second equipment. It should be noted that, the second device may transmit the low-resolution video image and the hyper-resolution model information to the first device at the same time or at non-same time, which is not limited in this embodiment.

Step S20, obtaining a target hyper-score model aiming at the target speaker based on the hyper-score model information;

the first device obtains a hyper-score model for the target talker (hereinafter, referred to as a target hyper-score model to show a zone) based on the hyper-score model information. Specifically, the method for the first device to obtain the target hyper-score model is different according to different hyper-score model information, so in this embodiment, no limitation is imposed on the specific implementation manner for the first device to obtain the target hyper-score model based on the hyper-score model information, but the target hyper-score model obtained by the first device based on the hyper-score model information is specific to the target talker. Because the target hyper-resolution model is a hyper-resolution model for the target speaker, the image subjected to super-resolution reconstruction by using the target hyper-resolution model is a video image of a specific character of the target speaker, and the video images have similarity, the target hyper-resolution model only needs to have higher reconstruction accuracy on the image of the specific character, but does not necessarily need to have better generalization capability, so that in the embodiment, a model structure with smaller scale can be used as the model structure of the target hyper-resolution model. And because the target hyper-resolution model is obtained by the first device according to hyper-resolution model information which is dynamically sent by the second device based on the speaker of the call and aims at the target speaker, a model structure with a smaller scale is adopted as the model structure of the target hyper-resolution model, and when the target hyper-resolution model is adopted to perform super-resolution reconstruction on a low-resolution video image of the target speaker, higher reconstruction accuracy can be obtained.

And step S30, performing super-resolution reconstruction on the low-resolution video image by using the target hyper-resolution model to obtain a first high-resolution video image, and outputting the first high-resolution video image, wherein the resolution of the first high-resolution video image is higher than that of the low-resolution video image.

The first device may perform super-resolution reconstruction on the received low-resolution video image by using the target super-resolution model to obtain a high-resolution video image (the reconstructed high-resolution video image is referred to as a first high-resolution video image in the following text, which is different from the high-resolution video image in the call video data of the second device); since the first high-resolution video image is an image obtained by performing super-resolution reconstruction on the low-resolution video image, the resolution of the first high-resolution video image is higher than that of the low-resolution video image. The specific process of performing super-resolution reconstruction on the low-resolution video image by using the target super-resolution model to obtain the first high-resolution video image may refer to the super-resolution reconstruction process of the super-resolution model, which is not described in detail herein.

In an embodiment, after receiving each frame of low-resolution video image, the second device may perform super-resolution reconstruction by using a target super-resolution model to obtain a first high-resolution video image and output the first high-resolution video image; in this case, since the acquisition of the target hyper-resolution model requires a certain time, the output of the previous frames of low-resolution video images will have a certain delay, and thus the method can be applied to an application scenario with less strict requirements on real-time performance, for example, a case where a short delay is allowed to occur at the start of a video call. In another embodiment, when the requirement on the real-time performance of the video call is high, in order to improve the call real-time performance, the second device processes each frame of high-resolution video image acquired by the camera device into a low-resolution video image in real time and sends the low-resolution video image to the first device in real time; for each frame of low-resolution video image, if the first device does not obtain the target hyper-resolution model when receiving the frame of low-resolution video image, the first device can output the low-resolution video image in real time to ensure the real-time performance of video call; if the target hyper-resolution model is obtained when the frame of low-resolution video image is received, the target hyper-resolution model can be adopted to carry out super-resolution reconstruction on the low-resolution video image and then send the low-resolution video image to the first equipment so as to take conversation instantaneity and video conversation quality into account.

In the embodiment, a first device acquires a low-resolution video image and hyper-resolution model information sent by a second device, wherein the second device processes collected call video data of the call into the low-resolution video image and sends the low-resolution video image to the first device, and after determining a target speaker who uses the second device to participate in the call according to the call video data, acquires the hyper-resolution model information according to the target speaker and sends the hyper-resolution model information to the first device; the first equipment obtains a target hyper-score model aiming at the target speaker based on the hyper-score model information; the method and the device have the advantages that the target hyper-resolution model of the speaker for the video call is dynamically acquired in the video call process, and the target hyper-resolution model is the hyper-resolution model of the speaker for the video call, so that higher reconstruction accuracy can be acquired when a model structure with a smaller scale is adopted as the model structure of the target hyper-resolution model and the target hyper-resolution model is adopted by the second device to perform super-resolution reconstruction on a low-resolution video image of the target speaker, and real-time smooth playing is realized on the premise of better video quality restoration.

Further, based on the first embodiment, a second embodiment of the video call method of the present invention is provided, in this embodiment, when the hyper-resolution model information is a target model parameter, the step S20 includes:

step S201, replacing model parameters at a preset position in a first general hyper-resolution model with the target model parameters to obtain a target hyper-resolution model for the target speaker; the target model parameters are model parameters at the preset position in a target speaker model extracted and obtained by the second device, the target speaker model is a hyper-segmentation model obtained by the second device through training a second general hyper-segmentation model by using a video image corresponding to the target speaker in advance, and the first general hyper-segmentation model and the second general hyper-segmentation model are hyper-segmentation models obtained through training by using preset general image training data and preset in the first device and the second device respectively.

In this embodiment, a general hyper-differentiation model may be preset in each of the first device and the second device, where the general hyper-differentiation model in the first device is referred to as a first general hyper-differentiation model, the general hyper-differentiation model in the second device is referred to as a second general hyper-differentiation model, and the first general hyper-differentiation model and the second general hyper-differentiation model have the same model structure or have the same partial model structure. The general hyper-segmentation model is obtained by training preset general image training data, and a model structure of the general hyper-segmentation model can be a hyper-segmentation model structure with a small scale. The images in the generic image training data are generic images, i.e., images that are not restricted to a particular speaker. The general image training data used for training the first general hyper-differential model and the second general hyper-differential model may be the same or different, and therefore, the model parameters in the first general hyper-differential model and the second general hyper-differential model may be the same or different. The first general hyper-differential model and the second general hyper-differential model can be obtained by local training of the first equipment and the second equipment respectively, or can be sent to the first equipment and the second equipment after being trained by other equipment; for example, when the first device and the second device are respectively used as clients of a video call, a server of the video call can train a uniform universal hyper-diversity model to distribute the uniform universal hyper-diversity model to the clients. It can be understood that when a model structure with a smaller scale is selected as the model structure of the universal hyper-resolution model in order to improve the fluency of the video call, the super-resolution reconstruction accuracy of the trained universal hyper-resolution model is generally not as high as that when a model structure with a larger scale is adopted.

The second device may further train the second general hyper-segmentation model by using the video image corresponding to the target speaker in advance, and obtain a hyper-segmentation model for the target speaker (hereinafter, referred to as a target speaker model to show the differences). The video image corresponding to the target speaker can be from the call video data of the target speaker during the historical video call, that is, before the video call, the target speaker also uses the second device to perform the video call, and the second device can extract the video image from the call video data acquired during the video call as training data to train the second universal super-resolution model; or the video image corresponding to the target speaker may be a video image of the target speaker acquired by the second device through the camera device before the video call, for example, before the video call is performed by the target speaker, the second device may output a prompt message to prompt the target speaker that a video of a segment of the speaker needs to be input, and start the camera to acquire the video image of the target speaker after receiving an instruction to agree to input, and train the video image with the second universal super-resolution model, thereby obtaining the target speaker model. It can be understood that, because the target speaker model is obtained by training the video image corresponding to the target speaker on the basis of the second universal hyper-differential model, although the generalization capability of the target speaker model may not be as good as that of the second universal hyper-differential model, the accuracy of the super-resolution reconstruction of the video image of the target speaker by the target speaker model is higher than that of the second universal hyper-differential model.

The second device may extract a model parameter at a preset position in the target speaker model as a target model parameter, and send the target model parameter as hyper-resolution model information to the first device. The preset positions may be set as needed, for example, all parameter positions in the target speaker model may be set, or, for example, the preset positions may be set as a third layer in the target speaker model; when the video image corresponding to the target speaker is adopted to further train the second general hyper-resolution model, and when parameters of certain positions are fixed, the positions of the parameters except the fixed parameters can be used as preset positions.

After receiving the target model parameters sent by the second device, the first device may replace the model parameters at the preset position in the first general hyper-segmentation model with the target model parameters to obtain a target hyper-segmentation model. It should be noted that, when only a part of the model structures of the first general hyper-differential model and the second general hyper-differential model are the same, the preset position is located at the same model structure part. The model parameters in the target speaker model aiming at the target speaker are sent to the first device by the second device, so that the target hyper-segmentation model obtained by replacing the model parameters in the first general hyper-segmentation model by the second device is also the hyper-segmentation model aiming at the target speaker, and the accuracy of super-resolution reconstruction of the video image of the target speaker by the target hyper-segmentation model is the same as that of the target speaker model and is higher than that of the second general hyper-segmentation model. Therefore, when the first device adopts the target hyper-resolution model to perform super-resolution reconstruction on the low-resolution video image sent by the second device, the high-resolution video image with better quality can be restored, and a model structure with smaller scale can be selected as the model structure of the general hyper-resolution model, so that a more real-time and smooth video call effect can be realized.

Further, in an embodiment, the second device may also directly send the model file of the target speaker model to the first device, and the first device directly performs super-resolution reconstruction on the low-resolution video image by using the target speaker model in the model file to obtain a first high-resolution video image and outputs the first high-resolution video image. At this time, the first general hyper-parting model may not need to be set in advance in the first device.

Further, in an embodiment, when the hyper-segmentation model information is a second high-resolution video image, the step S20 includes:

step S202, preprocessing the second high-resolution video image to obtain target training data; when the second device does not find the target speaker model corresponding to the target speaker in a preset model base, the second high-resolution video image is a high-resolution video image extracted from the call video data, the resolution of the second high-resolution video image is higher than that of the low-resolution video image, the preset model base is used for storing a speaker model corresponding to a historical speaker, the historical speaker is a speaker who uses the second device to participate in video call and is determined by the second device according to historical call video data, and the speaker model corresponding to the historical speaker is obtained by the second device through training of the second device on the second common hyper-resolution model by using the video image corresponding to the historical speaker;

the second device may be preset with a model library for storing the speaker models of each historical speaker. Specifically, the second device may determine, according to call video data of the video call, a target speaker who uses the second device to participate in the video call, and search whether a speaker model of the target speaker exists in the model base, if so, the speaker model is used as the target speaker model, if not, the target speaker is used as a history speaker, and a video image corresponding to the history speaker is used to train the second hyper-resolution model to obtain a speaker model for the history speaker, and the speaker model is added to the model base, so that when the history speaker performs the video call again, the second device may directly obtain the speaker data of the speaker from the model base. The video image corresponding to the historical call person may be a high-resolution video image extracted from call video data collected in the video call.

If the second device does not find the speaker model corresponding to the target speaker in the model library, the second device may extract a high-resolution video image (hereinafter, referred to as a second high-resolution video image) from the call video data of the call as hyper-resolution model information and send the hyper-resolution model information to the first device.

After receiving the second high-resolution video image, the first device may pre-process the second high-resolution video image to obtain training data (hereinafter referred to as target training data). The preprocessing may include processing operations such as clipping, enlarging or reducing the image to a fixed size, and rotating, and may specifically refer to the existing preprocessing method for the image training data, which is not described in detail herein.

Step S203, the first general hyper-score model is trained by the target training data to obtain a target hyper-score model for the target speaker.

And the first equipment trains the first general hyper-score model by adopting target training data to obtain a target hyper-score model. Because the target hyper-resolution model is obtained by training the high-resolution video image corresponding to the target speaker on the basis of the first general hyper-resolution model, the accuracy of super-resolution reconstruction of the video image of the target speaker by the target hyper-resolution model is higher than that of the first general hyper-resolution model although the generalization capability of the target hyper-resolution model may not be as good as that of the first general hyper-resolution model. Therefore, when the first device adopts the target hyper-resolution model to perform super-resolution reconstruction on the low-resolution video image sent by the second device, the high-resolution video image with better quality can be restored, and a model structure with smaller scale can be selected as the model structure of the general hyper-resolution model, so that a more real-time and smooth video call effect can be realized.

Further, in an embodiment, the step S203 includes:

step S2031, training the first general hyper-resolution model by using the target training data as a training set, verifying by using the general image training data as a verification set, and stopping training after training for a preset time when detecting that the super-resolution reconstruction accuracy of the first general hyper-resolution model on the training set is higher than that of the first general hyper-resolution model on the verification set, so as to obtain the target hyper-resolution model for the target speaker.

In order to further improve the super-resolution reconstruction accuracy of the target hyper-resolution model on the video image of the target speaker, the first device may train the first general hyper-resolution model by using the target training data as a training set, and verify by using the general image training data for training the first general hyper-resolution model as a verification set. During the training process, the first device may calculate the super-resolution reconstruction accuracy of the first general hyper-differential model on the training set and the verification set respectively after each round of training or each round of training, and detect whether the accuracy on the training set is greater than the accuracy on the verification set; if not, performing the next round of training, and performing the next accuracy calculation and detection; and if so, stopping training after training for a preset time, and taking the first general hyper-score model obtained by final training as a target hyper-score model. The specific super-resolution reconstruction accuracy calculation method can refer to the existing accuracy calculation method, and is not described in detail herein; the preset duration may be set according to specific needs, and is not limited in this embodiment.

It should be noted that when the super-resolution reconstruction accuracy of the first universal hyper-resolution model on the training set is greater than the super-resolution reconstruction accuracy of the first universal hyper-resolution model on the verification set, it indicates the highest point that the generalization ability of the first universal hyper-resolution model may reach, that is, after that, if the training is continued, the fitting degree of the first universal hyper-resolution model to the training set is higher and higher, that is, the super-resolution reconstruction accuracy on the training set is higher and higher, and the super-resolution reconstruction accuracy on the verification set may decrease; in this embodiment, when the super-resolution reconstruction accuracy of the first universal super-resolution model on the training set is greater than the super-resolution reconstruction accuracy of the first universal super-resolution model on the verification set, the training is continued for a preset duration, so that the super-resolution reconstruction accuracy of the first universal super-resolution model on the video image of the target speaker is higher and higher, even if a certain over-fitting phenomenon occurs, because in the video call scheme of this embodiment, it is only necessary that the target super-resolution model has higher super-resolution reconstruction accuracy on the video image of the target speaker, and it is not necessary that the target super-resolution model has higher generalization capability. Just because when the super-resolution reconstruction accuracy of the first general hyper-resolution model on the training set is greater than that of the first general hyper-resolution model on the verification set, the training is continued for a preset duration, so that the super-resolution reconstruction accuracy of the first general hyper-resolution model on the video image of the target speaker is further improved, and the video call quality is further improved while the real-time smooth effect of the video call is realized.

Further, based on the first and/or second embodiments, a third embodiment of the video call method of the present invention is provided, where in this embodiment, the method is applied to a second device participating in a video call, and the method includes:

step A10, collecting the call video data of the call, and processing the call video data into a low-resolution video image;

step A20, determining a target speaker who uses the second device to participate in the current call according to the video call data, and acquiring hyper-resolution model information according to the target speaker;

further, in an embodiment, the step of determining, in step a20, a target speaker participating in the current call using the second device according to the video call data includes:

step A201, extracting a video image from the video call data, and performing face recognition on the video image to obtain a target call participant using the second device to participate in the video call.

Step A30, sending the low-resolution video image and the hyper-resolution model information to the first device, so that the first device obtains a target hyper-resolution model for the target speaker based on the hyper-resolution model information, and performing super-resolution reconstruction on the low-resolution video image by using the target hyper-resolution model to obtain and output a first high-resolution video image, wherein the resolution of the first high-resolution video image is higher than that of the low-resolution video image.

The detailed implementation of the step a201 of the steps a10 to a30 and the step a20 in this embodiment can refer to the detailed implementation of the steps S10 to S30 in the first embodiment, which is not described in detail herein.

Further, in an embodiment, the step of obtaining the hyper-resolution model information according to the target speaker in step a20 includes:

step A202, searching whether a target speaker model corresponding to the target speaker exists in a preset model base, wherein the preset model base is used for storing a speaker model corresponding to a historical speaker, the historical speaker is a speaker who uses the second device to participate in video call and is determined by the second device according to historical call video data, the speaker model corresponding to the historical speaker is obtained by the second device through training a second common hyperscoring model by using a video image corresponding to the historical speaker, and the second common hyperscoring model is a hyperscoring model which is obtained through training by using preset common image training data and is preset in the second device;

step A203, if the target speaker model is not found, extracting a second high-resolution video image from the call video data as the hyper-resolution model information so that the first device can preprocess the second high-resolution video image to obtain target training data, and training the first general hyper-resolution model by using the target training data to obtain a target hyper-resolution model for the target speaker, wherein the resolution of the second high-resolution video image is higher than that of the low-resolution video image.

In this embodiment, the specific implementation of steps a202 to a203 may refer to the specific implementation of steps S202 to S203 and the refinement step S2031 of S203 in the first embodiment, which is not described in detail herein.

Further, in an embodiment, the step of obtaining the hyper-resolution model information according to the target speaker in the step a20 further includes:

step A204, if the target speaker model is found, taking a model file of the target speaker model as the hyper-resolution model information, so that the first device takes the target speaker model in the model file as a target hyper-resolution model for the target speaker; or the like, or, alternatively,

step A205, if the target speaker model is found, extracting a target model parameter at a preset position in the target speaker model as hyper-resolution model information, so that the first device replaces the model parameter at the preset position in a first general hyper-resolution model with the target model parameter to obtain a target hyper-resolution model for the target speaker; the target speaker model is a hyper-resolution model obtained by the second device through training the second general hyper-resolution model by using a video image corresponding to the target speaker in advance, and the first general hyper-resolution model is a hyper-resolution model obtained by the first device through training by using the general image training data.

In this embodiment, the specific implementation of steps a204 to a205 can refer to the specific implementation of step S201 in the first embodiment, and details are not described herein.

In addition, an embodiment of the present invention further provides a video call device, where the video call device is deployed in a first device, and the video call device includes:

Further, when the hyper-resolution model information is a target model parameter, the second obtaining module is further configured to:

Further, when the hyper-resolution model information is a second high-resolution video image, the second obtaining module is further configured to:

Further, the second obtaining module is further configured to:

In addition, an embodiment of the present invention further provides a video call device, where the video call device is deployed in a second device, and the video call device includes:

Further, the determining module is further configured to:

The specific embodiment of the video call device of the present invention has basically the same extension as the embodiments of the video call method, and is not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a video call program is stored on the storage medium, and the video call program, when executed by a processor, implements the steps of the video call method as described below.

The invention also proposes a computer program product comprising a computer program which, when executed by a processor, implements the steps of the video call method as described above.

The embodiments of the video call device, the computer-readable storage medium, and the computer program product of the present invention can refer to the embodiments of the video call method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A video call method applied to a first device participating in a video call, the method comprising:

2. The video call method of claim 1, wherein when the hyper-score model information is a target model parameter, the step of obtaining a target hyper-score model for the target speaker based on the hyper-score model information comprises:

3. The video call method of claim 2, wherein when the hyper-segmentation model information is a second high resolution video image, the step of obtaining a target hyper-segmentation model for the target call recipient based on the hyper-segmentation model information comprises:

4. The video call method of claim 3, wherein said step of training said first universal hyper-score model with said target training data to obtain a target hyper-score model for said target speaker comprises:

5. A video call method applied to a second device participating in a video call, the method comprising:

6. The video call method of claim 5, wherein the step of obtaining the hyper-resolution model information according to the target speaker comprises:

7. The video call method according to claim 6, wherein after the step of searching whether the preset model library has the target call person model corresponding to the target call person, the method further comprises:

8. The video call method of any of claims 5 to 7, wherein the step of determining a target speaker to use the second device to participate in a video call based on the video call data comprises:

9. A video call device, characterized in that the video call device comprises: a memory, a processor and a video call program stored on the memory and executable on the processor, the video call program when executed by the processor implementing the steps of the video call method of any one of claims 1-4 or the steps of the video call method of any one of claims 5-8.

10. A computer-readable storage medium, having a video call program stored thereon, which when executed by a processor, performs the steps of the video call method of any one of claims 1-4, or the steps of the video call method of any one of claims 5-8.

11. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the video call method according to any one of claims 1-4 or the steps of the video call method according to any one of claims 5-8 when executed by a processor.