CN113259707A

CN113259707A - Virtual human image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113259707A
Application number: CN202110661903.4A
Authority: CN
Inventors: 杨国基; 常向月; 王鑫宇; 刘炫鹏; 刘致远; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-08-13
Anticipated expiration: 2041-06-15
Also published as: CN113259707B

Abstract

The application discloses a virtual human image processing method, a virtual human image processing device, electronic equipment and a storage medium, which relate to the technical field of image processing, and the method comprises the following steps: acquiring a plurality of video frames based on the action parameters of the virtual human; dividing the video frames into a plurality of image groups according to a preset rule; and acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises a plurality of preset image groups. According to the method and the device, the cache hit rate can be effectively improved, so that the transmission processing efficiency of the virtual human image is improved.

Description

Virtual human image processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for processing a virtual human image, an electronic device, and a storage medium.

Background

With the rapid development of science and technology, human-computer interaction technology has penetrated aspects of daily life, wherein digital virtual people become more and more important in human-computer interaction technology, and digital virtual people are virtual three-dimensional people (hereinafter, referred to as virtual people) manufactured by utilizing virtual reality technology, human-computer interaction, high-precision three-dimensional human image simulation, Artificial Intelligence (AI), motion capture, facial expression capture and other technologies.

At present, the virtual human image is usually hit from the cache by using an image cache (cache) technology, and the hit virtual human image is transmitted.

However, the current image cache technology only caches the entire completed paragraph (usually a paragraph, with a duration of several seconds or several tens of seconds), thereby greatly reducing the cache hit rate and further reducing the efficiency of transmission processing of the virtual human image.

Disclosure of Invention

In view of the above problems, the present application provides a method and an apparatus for processing a virtual human image, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present application provides a method for processing an image of a virtual human, where the method includes: acquiring a plurality of video frames based on the action parameters of the virtual human; dividing the video frames into a plurality of image groups according to a preset rule; acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as a first image group; the cache image group comprises a plurality of preset image groups; and transmitting the first image group.

Further, the method further comprises: acquiring an image group which misses the cache image group from the plurality of image groups as an image group to be encoded, and encoding a video frame in the image group to be encoded to obtain a second image group; and transmitting the second image group.

Further, the transmitting the first image group includes:

and if the playing sequence of the first image group is behind the playing sequence of the second image group, transmitting the first image group after the transmission of the second image group is finished.

Further, the transmitting the second image group includes: and transmitting the second image group through an RTP protocol.

Further, the dividing the plurality of video frames into a plurality of image groups according to a preset rule includes: acquiring action parameters corresponding to each video frame in a plurality of video frames; if the action parameter corresponding to the video frame is an effective action parameter, taking the video frame as a tangent point video frame; dividing the plurality of video frames into a plurality of image groups based on the tangent point video frames.

Further, if the motion parameter corresponding to the video frame is an effective motion parameter, before taking the video frame as a tangent point video frame, the method further includes: acquiring a standard image corresponding to the action parameter; acquiring the probability of generating the standard image based on the action parameters; and if the probability is not lower than the probability threshold, determining the action parameter as an effective action parameter.

Further, the acquiring a preset image group hit by an image group in the plurality of image groups in the cached image group as a first image group includes: acquiring a first action parameter corresponding to each image group in the plurality of image groups and a second action parameter corresponding to each preset image group in the plurality of preset image groups; and if the second motion parameters include target second motion parameters matched with the first motion parameters, determining a preset image group corresponding to the target second motion parameters as a preset image group hit by an image group in the plurality of image groups.

Further, before determining, if there is a target second motion parameter matching the first motion parameter in the second motion parameters, a preset image group corresponding to the target second motion parameter as a preset image group hit by an image group in the plurality of image groups, the method further includes: and if a target second action parameter with the similarity larger than a similarity threshold exists in the second action parameters, determining that a target second action parameter matched with the first action parameter exists in the second action parameters.

Further, the acquiring a preset image group hit by an image group in the plurality of image groups in the cached image group as a first image group includes: acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as an initial image group; acquiring a third action parameter corresponding to a first frame picture in the initial image group and a fourth action parameter corresponding to a last frame picture in an image group with a transmission sequence before the initial image group; and if the third motion parameter is matched with the fourth motion parameter, determining the initial image group as the first image group.

Further, the number of the video frames in the first image group does not exceed a specified number, wherein each video frame in the first image group corresponds to one playing time, and the interval duration between the playing times corresponding to two adjacent video frames does not exceed a specified duration.

Further, the transmitting the first image group includes: and transmitting the first image group through a TCP protocol.

Further, the acquiring a plurality of video frames based on the action parameters of the virtual human includes: acquiring action parameters of the virtual human; determining whether a preset video frame corresponding to the action parameter exists in a cached video frame according to the action parameter, wherein the cached video frame comprises a plurality of preset video frames; if the action parameter exists, acquiring a preset video frame corresponding to the action parameter as a first video frame; if the motion parameters do not exist, inputting the motion parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model; obtaining the plurality of video frames based on the first video frame and the second video frame.

Further, the deriving the plurality of video frames based on the first video frame and the second video frame comprises: acquiring a specified number of video frames to be transmitted in the transmission sequence before the first video frame; acquiring pixel similarity between the first video frame and each video frame to be transmitted in the specified number of video frames to be transmitted; if the pixel similarity exceeds a pixel similarity threshold, obtaining the plurality of video frames based on the first video frame and the second video frame; and if the pixel similarity does not exceed a pixel similarity threshold, executing the action parameters to be input into a pre-trained machine learning model, and acquiring a second video frame output by the machine learning model.

Further, the acquiring the action parameters of the virtual human includes: acquiring input information acting on the virtual human, wherein the input information comprises one or more combinations of voice information, text information and image information; and inputting the input information into a pre-trained virtual human action model, and acquiring action parameters output by the virtual human action model.

In a second aspect, an embodiment of the present application provides a method for processing an image of a virtual human, where the method is applied to an interactive system, where the interactive system includes a server and a terminal device, and the method includes: the server acquires a plurality of video frames based on the action parameters of the virtual human and divides the video frames into a plurality of image groups according to a preset rule; the server acquires a preset image group hit by an image group in the plurality of image groups in a cache image group as a first image group and transmits the first image group to the terminal equipment, wherein the cache image group comprises the plurality of preset image groups; the server acquires an image group which misses the cache image group from the plurality of image groups and takes the image group as an image group to be encoded, encodes a video frame in the image group to be encoded to obtain a second image group, and transmits the second image group to the terminal equipment; and the terminal equipment plays the first image group and the second image group according to the playing sequence of the first image group and the second image group.

In a third aspect, an embodiment of the present application provides a virtual human image processing apparatus, including:

and the video frame acquisition module is used for acquiring a plurality of video frames based on the action parameters of the virtual human.

And the image group dividing module is used for dividing the video frames into a plurality of image groups according to a preset rule.

The first image group processing module is used for acquiring a preset image group hit by an image group in the plurality of image groups in a cache image group as a first image group, wherein the cache image group comprises a plurality of preset image groups.

And the first transmission module is used for transmitting the first image group.

Further, the virtual human image processing device further comprises:

and the second image group processing module is used for acquiring an image group which does not hit the cache image group in the plurality of image groups, using the image group as an image group to be encoded, encoding a video frame in the image group to be encoded to obtain a second image group, and transmitting the second image group.

And the second transmission module is used for transmitting the second image group.

Further, the first transmission module is specifically configured to transmit the first image group after the transmission of the second image group is completed if the playing order of the first image group is after the playing order of the second image group.

Further, the second transmission module is specifically configured to transmit the second group of pictures through an RTP protocol.

Further, the image group dividing module comprises:

and the action parameter acquiring unit is used for acquiring the action parameter corresponding to each video frame in the plurality of video frames.

And the tangent point video frame determining unit is used for taking the video frame as the tangent point video frame if the action parameter corresponding to the video frame is the effective action parameter.

And the dividing unit is used for dividing the plurality of video frames into a plurality of image groups based on the tangent point video frames.

Further, the image group dividing module further comprises:

and the standard image acquisition unit is used for acquiring a standard image corresponding to the action parameter.

And a probability generating unit for acquiring a probability of generating the standard image based on the motion parameter.

And the effective action parameter determining unit is used for determining the action parameter as the effective action parameter if the probability is not lower than the probability threshold.

Further, the first image group processing module is specifically configured to: acquiring a first action parameter corresponding to each image group in a plurality of image groups and a second action parameter corresponding to each preset image group in a plurality of preset image groups; and if the second motion parameter has a target second motion parameter matched with the first motion parameter, determining a preset image group corresponding to the target second motion parameter as a preset image group hit by an image group in the plurality of image groups.

Further, the first image group processing module is specifically configured to: and if the target second action parameter with the similarity larger than the similarity threshold exists in the second action parameters, determining that the target second action parameter matched with the first action parameter exists in the second action parameters.

Further, the first image group processing module is further configured to: acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as an initial image group; acquiring a third action parameter corresponding to a first frame picture in the initial image group and a fourth action parameter corresponding to a last frame picture in an image group with a transmission sequence before the initial image group; and if the third motion parameter is matched with the fourth motion parameter, determining the initial image group as the first image group.

Further, the first group of images processing module is specifically configured to transmit the first group of images via a TCP protocol.

Further, the video frame acquisition module is specifically configured to: acquiring action parameters of a virtual human; determining whether a preset video frame corresponding to the action parameter exists in the cached video frame according to the action parameter, wherein the cached video frame comprises a plurality of preset video frames; if the motion parameter exists, acquiring a preset video frame corresponding to the motion parameter as a first video frame; if the motion parameters do not exist, inputting the motion parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model; a plurality of video frames is derived based on the first video frame and the second video frame.

Further, the video frame acquisition module is specifically configured to: acquiring a specified number of video frames to be transmitted in a transmission sequence before a first video frame; acquiring pixel similarity between a first video frame and each video frame to be transmitted in a specified number of video frames to be transmitted; if the pixel similarity exceeds a pixel similarity threshold, obtaining a plurality of video frames based on the first video frame and the second video frame; and if the pixel similarity does not exceed the pixel similarity threshold, inputting the action parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model.

Further, the video frame acquisition module is also used for acquiring input information acting on the virtual human, wherein the input information comprises one or more combinations of voice information, text information and image information; and inputting the input information into a pre-trained virtual human action model, and acquiring action parameters output by the virtual human action model.

Further, the device also comprises a transmission module which is used for transmitting the first image group and the second image group to the client so as to instruct the client to assemble the first image group and the second image group into the virtual human video and play the virtual human video.

In a fourth aspect, an embodiment of the present application provides an electronic device, which includes: memory, one or more processors, and one or more applications. Wherein the one or more processors are coupled with the memory. One or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs configured to perform the method of the first aspect as described above.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium, in which program code is stored, and the program code can be called by a processor to execute the method according to the first aspect.

According to the method, the device, the electronic equipment and the storage medium for processing the virtual human image, a plurality of video frames are obtained through the action parameters based on the virtual human; dividing a plurality of video frames into a plurality of image groups according to a preset rule; the method comprises the steps of obtaining a preset image group hit by an image group in a plurality of image groups in a cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises the plurality of preset image groups, so that a plurality of video frames are divided into the image group as a unit to hit, and equivalently, the plurality of video frames are hit as a whole video, thereby greatly improving the hit rate and further improving the transmission efficiency of the virtual human image.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic application environment provided by an embodiment of the present application.

Fig. 2 shows a flowchart of a method for processing an image of a virtual human provided by a first embodiment of the present application.

Fig. 3 shows a flowchart of a method for processing an image of a virtual human provided by a second embodiment of the present application.

Fig. 4 shows a flowchart of a method for processing an image of a virtual human provided by a third embodiment of the present application.

Fig. 5 shows a flowchart of a method for processing an image of a virtual human provided by a fourth embodiment of the present application.

Fig. 6 shows a flowchart of a method for processing an image of a virtual human provided by a fifth embodiment of the present application.

Fig. 7 shows a flowchart of a method for processing an image of a virtual human provided by a sixth embodiment of the present application.

Fig. 8 shows a flowchart of S610 of a method for processing an image of a virtual human provided by a sixth embodiment of the present application.

Fig. 9 shows a flowchart of S650 of a method for processing an image of a virtual human provided by a sixth embodiment of the present application.

Fig. 10 shows a flowchart of a method for processing an image of a virtual human provided by a seventh embodiment of the present application.

Fig. 11 shows a flowchart of a specific implementation of a method for processing an image of a virtual human provided by a seventh embodiment of the present application.

Fig. 12 shows a block diagram of a virtual human image processing apparatus according to a ninth embodiment of the present application.

Fig. 13 is a block diagram of an electronic device for executing a avatar image processing method according to an eleventh embodiment of the present application.

Fig. 14 is a storage unit for saving or carrying a program code for implementing the avatar image processing method according to the twelfth embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

With the rapid development of science and technology, a human-computer interaction technology permeates into various aspects of daily life, wherein a digital virtual human becomes more and more important in the human-computer interaction technology, and a virtual digital human (hereinafter, referred to as a digital human) is a software form robot which has a 2D or 3D virtual image and can naturally interact with a user, and the virtual digital human is generally displayed in front of the user in the form of animation in the forms of a screen, projection and the like. When a digital virtual person is made, training data is obtained by recording videos by real actors, and then the training data is trained with the digital virtual person, so that the digital virtual person can move like a real person.

At present, when a virtual human image is transmitted to display the virtual human image at a client, an image cache (cache) technology is usually adopted to hit the virtual human image from a cache, and the hit virtual human image is transmitted. For example, a plurality of frames of standard virtual human images may be stored in the cache in advance, then a target virtual human image is generated according to actual requirements, whether the target virtual human image can hit the multi-frame standard virtual human image or not is checked, and if yes, the image of the multi-frame standard virtual human image hit by the target virtual human image is transmitted.

However, the current image cache technology only caches videos corresponding to a complete paragraph (usually a paragraph, a period of several seconds or tens of seconds), and when a hit occurs, because it is difficult to simultaneously hit a plurality of frames of images in a long video, the cache hit rate is greatly reduced, and the efficiency of transmission processing of virtual human images is further reduced.

In order to solve the above problems, the inventor proposes a method, an apparatus, an electronic device, and a storage medium for processing a virtual human image in an embodiment of the present application, and divides a plurality of video frames included in a virtual human video to be transmitted into a plurality of image groups, and performs cache hit in units of the image groups, so that a cache hit rate can be effectively increased, and thus transmission efficiency of the virtual human image is improved.

The following describes in detail a method, an apparatus, an electronic device, and a storage medium for processing an image of a virtual human provided by an embodiment of the present application with specific embodiments.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The method for processing the virtual human image provided by the embodiment of the application can be applied to the interactive system 100 shown in fig. 1. The interactive system 100 comprises a terminal device 101 and a server 102, wherein the server 102 is in communication connection with the terminal device 101.

The server 102 may be a conventional server or a cloud server, and is not limited herein. The server 102 may have a system for processing virtual human image data, and for example, the system may perform processing for generating an image frame (also referred to as a video frame) from an operation parameter, processing for hit in an image cache, processing for grouping a plurality of image frames, encoding, decoding, and the like.

The terminal device 101 may be various electronic devices that have a display screen, a data processing module, a camera, an audio input/output function, and the like, and support data input, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a self-service terminal, a wearable electronic device, and the like. Specifically, the data input may be inputting voice based on a voice module provided on the electronic device, inputting characters based on a character input module, and the like.

The terminal device 101 may have a client application installed thereon, and the user may be based on the client application (for example, APP, wechat applet, etc.), where the conversation robot in this embodiment is also a client application configured in the terminal device 101. A user may register a user account in the server 102 based on the client application program, and communicate with the server 102 based on the user account, for example, the user logs in the user account in the client application program, inputs information through the client application program based on the user account, and may input text information or voice information, and the like, after receiving information input by the user, the client application program may send the information to the server 102, so that the server 102 may receive the information, process and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 101 according to the information.

First embodiment

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for processing an image of a virtual human provided by an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 2, the method may include:

and S110, acquiring a plurality of video frames based on the action parameters of the virtual human.

The action parameters can refer to parameters representing the action characteristics of the virtual human, and optionally, the action parameters can be multidimensional vectors formed by parameters including face characteristics, head orientation, body posture, contour lines and the like.

In some embodiments, the server may input the motion parameters of the virtual human into a pre-trained image model, and then acquire a video frame output by the model and corresponding to the motion parameters, where the model may be trained based on a plurality of sample motion parameters and a plurality of sample virtual human images. With this model, video frames corresponding to each of the plurality of motion parameters can be obtained. Optionally, the image model may train the avatar offline, and may train different avatars according to different images of the avatar (e.g., pictures, videos, etc.). And then, a video of the real person is taken as input to obtain action parameters, and then the virtual person image is obtained through the action parameters.

And S120, dividing the plurality of video frames into a plurality of image groups according to a preset rule.

Group of Pictures (GOP) is a term used in video coding, and a GOP is a Group of consecutive Pictures.

Illustratively, the M-frame picture may be sliced into M1+ M2+ M3 … according to a preset rule, where M1, M2, M3.

In some embodiments, the preset rule may be a preset number, and the server may divide the plurality of video frames into the plurality of image groups according to the preset number, so that the number of video frames included in each of the plurality of image groups is the preset number, for example, each image group includes N video frames.

In other embodiments, the preset rule may be a preset time duration, and the server may divide the plurality of video frames into the plurality of image groups according to the preset time duration, so that the time duration of each image group in the plurality of image groups is less than or equal to the preset time duration, for example, the time duration of the image group does not exceed 1 second.

In still other embodiments, the preset rule may be a target number of image groups, and the server may divide the plurality of video frames into the target number of image groups according to the number of divisions of the image groups.

S130, acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as a first image group, wherein the cache image group comprises a plurality of preset image groups.

The number of video frames included in each preset image group in the plurality of preset image groups may be the same as the number of image frames included in the image group.

In some embodiments, the server may compare the motion parameter corresponding to the image group with the motion parameter corresponding to the preset image group to determine whether the image group hits the preset image group, for example, if the similarity between the motion parameter corresponding to the image group and the motion parameter corresponding to the preset image group exceeds a similarity threshold, it may be determined that the image group hits the preset image group. And determining the hit image group as the first image group. The motion parameter corresponding to the image group may include a motion parameter of each video frame in the image group.

And S140, transmitting the first image group.

The server can transmit the first image group to the terminal equipment, so that the terminal can perform video display on the first image group.

In the embodiment, a plurality of video frames are obtained based on the action parameters of the virtual human; dividing a plurality of video frames into a plurality of image groups according to a preset rule; the method comprises the steps of obtaining a preset image group hit by an image group in a plurality of image groups in a cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises the plurality of preset image groups, so that a plurality of video frames are divided into the image group as a unit to hit, and equivalently, the plurality of video frames are hit as a whole video, thereby greatly improving the hit rate and further improving the transmission efficiency of the virtual human image.

Second embodiment

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a method for processing an image of a virtual human provided by an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 3, the method may include:

and S210, acquiring a plurality of video frames based on the action parameters of the virtual human.

S220, dividing the video frames into a plurality of image groups according to a preset rule.

And S230, acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as a first image group, wherein the cache image group comprises the plurality of preset image groups.

The specific implementation of S210 to S230 can refer to S110 to S130, and therefore is not described herein.

S240, the first image group is transmitted.

S250, acquiring an image group of the cache image group missed in the plurality of image groups as an image group to be encoded, and encoding a video frame in the image group to be encoded to obtain a second image group.

Taking advantage of the above example, for example, if the similarity between the motion parameter corresponding to the image group and the motion parameter corresponding to any one of the preset image groups does not exceed the similarity threshold, it is determined that the image group does not hit the image group of the cached image group, and then the image group that does not hit the cached image group is used as the image group to be encoded, and the image group to be encoded is encoded to form the second image group. The encoding of the group of pictures to be encoded means that the picture segments (such as M1, M2, or M3 …) corresponding to the group of pictures to be encoded, that is, a plurality of frames of pictures, are output as a group of pictures.

Alternatively, the algorithms for image encoding include, but are not limited to: h264, h265, vp8, vp9, and the like.

And S260, transmitting the second image group.

The server can transmit the second image group to the terminal equipment, so that the terminal can perform video display on the second image group.

In some embodiments, when transmitting the first group of pictures, if the playing order of the first group of pictures is after the playing order of the second group of pictures, the first group of pictures is transmitted after the transmission of the second group of pictures is completed. If the playing sequence of the first image group is before the second image group, the first image group can be directly transmitted, and after the first image group is transmitted, the second image group is transmitted.

The playing sequence of the image groups refers to a preset playing sequence of the plurality of image groups, and the plurality of image groups can be played in sequence according to the preset playing sequence. For example, the preset playing sequence is that the second image group is played first, and then the first image group is played, so that the playing sequence of the first image group is behind the playing sequence of the second image group. Alternatively, the playing order of a plurality of image groups may be determined according to the playing time node, for example, taking the playing time node of the first video frame of the image group 1 as 0s, the time length of each image group is 10s, then the time node of the image group 1 is 0-10s, the time node of the image group 2 is 10-20s, the time node of the image group 3 is 20-30s, and so on, then the playing order of the image group 1 to the image group 3 may be determined to be the image group 1, the image group 2, and the image group 3 in sequence.

In the embodiment, a plurality of video frames are obtained based on the action parameters of the virtual human; dividing a plurality of video frames into a plurality of image groups according to a preset rule; acquiring a preset image group hit by an image group in a plurality of image groups in a cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises the plurality of preset image groups; the method comprises the steps of obtaining an image group which does not hit a cache image group in a plurality of image groups, using the image group as an image group to be coded, coding video frames in the image group to be coded to obtain a second image group, and transmitting the second image group, so that the plurality of video frames are divided into the image group as a unit to be hit.

It should be noted that, the server may first perform the step of "acquiring a preset group of pictures hit by a group of pictures in the plurality of groups of pictures in the cached group of pictures as a first group of pictures and transmitting the first group of pictures", and then perform the step of "acquiring a group of pictures that does not hit the cached group of pictures in the plurality of groups of pictures as a group of pictures to be encoded and encoding a video frame in the group of pictures to be encoded to obtain a second group of pictures and transmitting the second group of pictures", or may first "acquiring a preset group of pictures hit by a group of pictures in the plurality of groups of pictures in the cached group of pictures as a first group of pictures", and "acquiring a group of pictures that does not hit the cached group of pictures in the plurality of groups of pictures as a group of pictures to be encoded and encoding a video frame in the group of pictures to be encoded to obtain a second group of pictures", and then the system simultaneously performs the digital human interaction, and outputting the first image group and the second image group in sequence.

Third embodiment

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a method for processing an image of a virtual human provided by an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 4, the method may include:

and S310, acquiring a plurality of video frames based on the action parameters of the virtual human.

The detailed implementation of S310 can refer to S210, and therefore is not described herein.

S320, acquiring action parameters corresponding to each video frame in the plurality of video frames.

Since each of the plurality of video frames is generated based on the motion parameter, the video frame corresponds to the motion parameter for generating the video frame, and therefore, when the video frame is specified, the motion parameter corresponding to the video frame can be directly specified. In particular, the motion parameters may be N-dimensional, one N-dimensional vector for each video frame. A segment of M frames of video can be generated by M N-dimensional vectors. The vector for each frame generates a frame of video.

And S330, if the action parameter corresponding to the video frame is the effective action parameter, taking the video frame as the tangent point video frame.

Here, the tangent point video frame refers to a video frame for dividing a video into tangent points, for example, a plurality of video frames of a piece of video are divided into M1 and M2, then the last frame of M1 may be the tangent point video frame, or then the first frame of M2 may be the tangent point video frame.

Here, the effective motion parameter refers to a motion parameter that can be used to effectively generate a standard virtual human image, and for example, if a virtual human image generated based on the motion parameter a is not standard or cannot be generated, the motion parameter a is not an effective motion parameter.

In some embodiments, before S230, the method may further include: acquiring a standard image corresponding to the action parameter; acquiring the probability of generating a standard image based on the action parameters; and if the probability is not lower than the probability threshold, determining the action parameter as an effective action parameter.

The standard image may be an image that the worker previously selects the best image from a plurality of images, wherein the plurality of images may be different images generated by one motion parameter. It is understood that, since machine learning models have diversity, incompatible images may be generated from the same motion parameters.

As an example, the server may store a plurality of motion parameters, a plurality of standard images, and a standard image mapping relationship between the plurality of motion parameters and the plurality of standard images in advance. Therefore, the standard image corresponding to the action parameter pair can be determined from the plurality of standard images according to the action parameter and the standard image mapping relation. Then, the history generation record is queried to obtain the probability of generating the standard image based on the motion parameter, for example, the probability of generating the corresponding standard image B based on the motion parameter B is 90%, and if the probability threshold is 80%, the motion parameter B may be determined as the valid motion parameter.

In the above example, if the motion parameter corresponding to the video frame B in the plurality of video frames is the motion parameter B, the video frame B may be used as the tangent point video frame.

In some embodiments, after determining a tangent point video frame, the continuity of the tangent point video frame with a video frame before or after the tangent point video frame may be further determined, and if the continuity exceeds a continuity threshold, the tangent point video frame may be determined to be available, otherwise, the tangent point video frame may not be available. Wherein the coherence of two video frames can be positively correlated with the similarity of the motion parameters corresponding to the two video frames.

In other embodiments, before S230, the method may further include: the action parameter corresponding to the specified action may be determined as a valid action parameter. Such as the motion of a person while still, for example.

Consider that the image generation algorithm used is a Generative Adaptive Networks (GAN) model in a deep learning model. The GAN model has diversity of outputs for the same input. This has the benefit of generating video with diversity, but increases the difficulty of using cache for this model. In the present embodiment, the effective operation parameters are input, and the output pictures are made almost identical, so that a good segmentation point is selected.

S340, dividing the plurality of video frames into a plurality of image groups based on the tangent point video frames.

Following the above example, for example, the plurality of video frames need to be divided into two groups of pictures, and the plurality of video frames can be divided into two groups of pictures by using the video frame b as the last frame of the first group of pictures or by using the video frame b as the first frame of the second group of pictures.

And S350, acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises the plurality of preset image groups.

S360, acquiring the image group of the cache image group missed in the plurality of image groups as an image group to be encoded, encoding the video frame in the image group to be encoded to obtain a second image group, and transmitting the second image group.

The specific implementation of S350 to S360 can refer to S230 to S260, and therefore, is not described herein.

In this embodiment, the action parameter corresponding to each of the plurality of video frames is obtained. And if the action parameter corresponding to the video frame is the effective action parameter, taking the video frame as the tangent point video frame. The plurality of video frames are divided into the plurality of image groups based on the tangent point video frames, so that the hit rate of the divided image groups can be effectively improved.

Fourth embodiment

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a method for processing an image of a virtual human provided by an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 5, the method may include:

and S410, acquiring a plurality of video frames based on the action parameters of the virtual human.

And S420, dividing the plurality of video frames into a plurality of image groups according to a preset rule.

S430, acquiring a first motion parameter corresponding to each image group in the plurality of image groups and a second motion parameter corresponding to each preset image group in the plurality of preset image groups.

The first motion parameter may include a motion parameter corresponding to each video frame in the image group, such as a set of motion parameters corresponding to each video frame in the image group. The second motion parameter may include a motion parameter corresponding to each video frame in the preset group of images, such as a set of motion parameters corresponding to each video frame in the preset group of images.

As an example, for example, the plurality of image groups include an image group a1 and an image group a2, and the motion parameters corresponding to two video frames in the image group a include a motion parameter 1 and a motion parameter 2, the parameter 1 and the motion parameter 2 may be used as the first motion parameter corresponding to the image group a 1. The plurality of preset image groups include a preset image group b1 and a preset image group b2, and if the motion parameters corresponding to two video frames in the preset image group b2 include motion parameter 3 and motion parameter 4, the motion parameter 3 and the motion parameter 4 may be used as the second motion parameter corresponding to the preset image group b 2.

And S440, if the second motion parameter has the target second motion parameter matched with the first motion parameter, determining a preset image group corresponding to the target second motion parameter as a preset image group hit by an image group in the plurality of image groups, using the preset image group as a first image group, and transmitting the first image group.

In some embodiments, when two motion parameters are consistent, it may be determined that the two motion parameters match. As an example, for example, if the first motion parameter a1 of the first motion parameters and the second motion parameter b1 of the second motion parameters are consistent, the first motion parameter a1 can be determined to match the second motion parameter b1, and the preset image group corresponding to the second motion parameter b1 can be determined as the preset image group hit by the image group of the image groups.

In some embodiments, S340 may further include, before: and if the target second action parameter with the similarity larger than the similarity threshold exists in the second action parameters, determining that the target second action parameter matched with the first action parameter exists in the second action parameters.

As an example, if the similarity between the first motion parameter a1 and the second motion parameter b1 is X, the similarity threshold is X1, and if X is greater than X1, it is determined that the first motion parameter a1 matches the second motion parameter b 1. The number of the same action parameters in the first action parameter a1 and the second action parameter b1 can be positively correlated with the similarity X.

S450, acquiring an image group of the missed cache image group in the plurality of image groups as an image group to be encoded, encoding a video frame in the image group to be encoded to obtain a second image group, and transmitting the second image group.

In the embodiment, a first motion parameter corresponding to each image group in a plurality of image groups and a second motion parameter corresponding to each preset image group in a plurality of preset image groups are acquired. And if the second motion parameter has a target second motion parameter matched with the first motion parameter, determining a preset image group corresponding to the target second motion parameter as a preset image group hit by an image group in the plurality of image groups. So that it can accurately judge whether the image group is a hit according to the motion parameter.

Fifth embodiment

Referring to fig. 6, fig. 6 is a schematic flowchart illustrating a method for processing an image of a virtual human provided by an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 6, the method may include:

and S510, acquiring a plurality of video frames based on the action parameters of the virtual human.

S520, dividing the video frames into a plurality of image groups according to a preset rule.

The specific implementation of S510 to S520 may refer to S210 to S220, and therefore, will not be described herein.

S530, acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as an initial image group.

S540, acquiring a third action parameter corresponding to a first frame picture in the initial image group and a fourth action parameter corresponding to a last frame picture in an image group with a transmission sequence before the initial image group.

As an example, for example, if the transmission order of the image group is image group a and image group B in turn, and the image group B is an initial image group, the fourth motion parameter corresponding to the last frame picture in the image group a and the third motion parameter corresponding to the first frame picture in the image group B can be acquired. The manner of obtaining the motion parameters corresponding to the picture may refer to the manner of obtaining the motion parameters corresponding to the video frame in the foregoing embodiment, and therefore is not described herein again.

And S550, if the third motion parameter is matched with the fourth motion parameter, determining the initial image group as a first image group, and transmitting the first image group.

In some embodiments, if the similarity between the third motion parameter and the fourth motion parameter is greater than the orientation-specific similarity, and it is determined that the third motion parameter and the fourth motion parameter match, the initial image group may be determined as the first image group.

In some embodiments, the number of the video frames in the first image group does not exceed a specified number, wherein each video frame in the first image group corresponds to one play time, and the interval duration between the play times corresponding to two adjacent video frames respectively does not exceed a specified duration.

And S560, acquiring the image group of the plurality of image groups which does not hit the cache image group as an image group to be encoded, encoding the video frame in the image group to be encoded to obtain a second image group, and transmitting the second image group.

In some embodiments, when the first image group and the second image group are transmitted, a specific embodiment may include: transmitting the first image group through a TCP (transmission control protocol); and transmitting the second image group through an RTP protocol.

In the present embodiment, a preset image group hit by an image group in a plurality of image groups in a cache image group is acquired as an initial image group. And acquiring a third action parameter corresponding to a first frame picture in the initial image group and a fourth action parameter corresponding to a last frame picture in an image group with a transmission sequence before the initial image group. If the third motion parameter is matched with the fourth motion parameter, the initial image group is determined as the first image group, so that the consistency between the first image group and the previously transmitted image group can be ensured, and the transmission quality of the virtual human image is further ensured.

Sixth embodiment

Referring to fig. 7, fig. 7 is a schematic flowchart illustrating a method for processing an image of a virtual human provided by an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 7, the method may include:

and S610, acquiring the action parameters of the virtual human.

In some embodiments, as shown in fig. 8, S610 may include:

and S611, acquiring input information acting on the virtual human, wherein the input information comprises one or more combinations of voice information, text information and image information.

The input information may be information for instructing the avatar to reply, for example, the avatar serves the avatar, and when the input information is a question consulted by the user, the client robot may make a corresponding answer and action for the question, where the action includes an expression action, a limb action, and the like.

And S612, inputting the input information into a pre-trained virtual human action model, and acquiring action parameters output by the virtual human action model.

The virtual human motion model can be formed by training a plurality of sample input information and a plurality of sample motion parameters in advance, the input information is input into the virtual human motion model which is trained in advance, and the motion parameters which are output by the virtual human motion model and correspond to the input information can be obtained.

Optionally, the model for training the avatar actions may include a GAN model.

In the embodiment, the action parameters of the virtual human are acquired through the pre-trained virtual human action model, so that the action parameters of the virtual human can be quickly and effectively acquired.

S620, determining whether a preset video frame corresponding to the action parameter exists in the cached video frame according to the action parameter, wherein the cached video frame comprises a plurality of preset video frames.

As an example, as shown in table 1, a mapping relation table may be established in advance for a plurality of preset video frames in a buffered video frame by a plurality of motion parameters, and if a corresponding preset video frame can be found from the buffered video frame according to the motion parameters, it may be determined that a preset video frame corresponding to the motion parameter exists. For example, if the action parameter a3 can find the preset video frame a3 from the mapping table, it is determined that the preset video frame a3 corresponding to the action parameter a3 exists. If the corresponding preset video frame cannot be found from the buffered video frames according to the action parameter a5, the corresponding preset video frame does not exist.

TABLE 1

Motion parameter	Preset video frame
		Motion parameter a1	Preset video frame a1
Motion parameter a2	Preset video frame a2
		Motion parameter a3	Preset video frame a3

And S630, if so, acquiring a preset video frame corresponding to the action parameter as a first video frame.

In connection with the above example, if it is determined that the video frame a3 corresponding to the motion parameter a3 is present in the buffered video frame corresponding to the motion parameter a3, the video frame a3 may be the first video frame.

And S640, if the motion parameters do not exist, inputting the motion parameters into the machine learning model trained in advance, and acquiring a second video frame output by the machine learning model.

In connection with the above example, if it is determined that the preset video frame corresponding to the motion parameter a5 does not exist, which corresponds to the motion parameter a5 missing from the buffered video frame, the motion parameter a5 may be input into the pre-trained machine learning model, and the second video frame output by the machine learning model may be obtained. The pre-trained machine learning model can be obtained by training based on a plurality of sample action parameters and a plurality of sample video frames.

S650, obtaining a plurality of video frames based on the first video frame and the second video frame.

In some embodiments, as shown in fig. 9, S650 may include:

and S651, acquiring a specified number of video frames to be transmitted before the first video frame in the transmission sequence.

The video frames to be transmitted and the transmission sequence may be preset in the server, and specifically, the transmission sequence may be a sequence in which the video frames are sent from the server to the terminal device.

As an example, for example, the transmission sequence is, in order, a video frame to be transmitted 1, a video frame to be transmitted 2, a video frame to be transmitted 3, and a video frame to be transmitted 4, where the video frame to be transmitted 4 is a first video frame. If the specified number is 2, the video frame 2 to be transmitted and the video frame 3 to be transmitted can be acquired.

And S652, acquiring the pixel similarity between the first video frame and each video frame to be transmitted in the specified number of video frames to be transmitted.

As an example, the server may compare the pixel similarity between the video frame 4 to be transmitted and the video frame 2 to be transmitted, and the pixel similarity between the video frame 4 to be transmitted and the video frame 3 to be transmitted. Optionally, when comparing the pixel similarity, the similarity of all the pixel points of the two video frames may be compared, or the similarity of only a specific pixel point may be compared.

S653, if the pixel similarity exceeds the pixel similarity threshold, a plurality of video frames are obtained based on the first video frame and the second video frame.

As an example, if the pixel similarity exceeds the pixel similarity threshold, it indicates that the first video frame is consistent with the type of the video frame to be transmitted before, and has a certain consistency, so that according to the fact that the first video frame hit by the action feature is available, obtaining a plurality of video frames based on the first video frame and the second video frame may be performed.

And S654, if the pixel similarity does not exceed the pixel similarity threshold, inputting the motion parameters into the pre-trained machine learning model, and acquiring a second video frame output by the machine learning model.

As an example, if the pixel similarity exceeds the pixel similarity threshold, indicating that the first video frame is unavailable, and needs to be regenerated, the motion parameter corresponding to the unavailable first video frame may be input into a pre-trained machine learning model, so as to obtain a second video frame output by the machine learning model and corresponding to the motion parameter, so as to replace the original unavailable first video frame. Therefore, the first video frame and the second video frame can be both available video frames, and the transmission quality of the virtual human image is ensured.

And S660, dividing the plurality of video frames into a plurality of image groups according to a preset rule.

S670, acquiring a preset image group hit by an image group in a plurality of image groups in the cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises the plurality of preset image groups.

And S680, acquiring an image group which does not hit the cache image group from the plurality of image groups as an image group to be encoded, encoding a video frame in the image group to be encoded to obtain a second image group, and transmitting the second image group.

In this embodiment, whether a preset video frame corresponding to the action parameter exists in the buffered video frame is determined according to the action parameter, where the buffered video frame includes a plurality of preset video frames. And if so, acquiring a preset video frame corresponding to the action parameter as a first video frame. And if the motion parameters do not exist, inputting the motion parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model. A plurality of video frames are obtained based on the first video frame and the second video frame, so that the corresponding video frames can be effectively obtained under the condition that the action parameters hit or miss the cache video frames.

Seventh embodiment

Referring to fig. 10, fig. 10 is a schematic flowchart illustrating a virtual human image processing method according to an embodiment of the present application. The method may be applied to the server in fig. 1, and as shown in fig. 9, the method may include:

and S710, acquiring a plurality of video frames based on the action parameters of the virtual human.

S720, dividing the plurality of video frames into a plurality of image groups according to a preset rule.

S730, a preset image group hit by the image group in the plurality of image groups in the cache image group is obtained as a first image group, wherein the cache image group comprises the plurality of preset image groups.

And S740, acquiring the image group of the missed cache image group in the plurality of image groups as an image group to be encoded, and encoding the video frame in the image group to be encoded to obtain a second image group.

And S750, transmitting the first image group and the second image group to the client to instruct the client to assemble the first image group and the second image group into the virtual human video, and playing the virtual human video.

The terminal device in fig. 1 may be used as a client.

In some embodiments, the server may transmit the first image group and the second image group to the client, and the client may splice the first image group and the second image group to assemble the first image group and the second image group into the virtual human video, and then play the virtual human video.

In practical application, the implementation flow of the virtual human image processing method of this embodiment may be as shown in fig. 11: information may be input to the virtual robot of the server, and then the action parameters corresponding to the input information are obtained through the virtual human action model, where the fifth embodiment may be referred to as a method for obtaining specific action parameters. Then, whether the action parameters hit the picture cache (i.e., the cached image frame) or not, if so, the hit picture frame may be directly adopted, and if not, the picture frame corresponding to the action parameters may be obtained based on the virtual human generation model, so as to obtain a plurality of picture frames, and then the plurality of picture frames are divided through optimal image group segmentation search, so as to obtain a plurality of image groups, where the specific dividing method may refer to the second embodiment. And then, performing image group cache hit on the image group, directly transmitting the hit image group to a client (namely terminal equipment) through a TCP (transmission control protocol) protocol if the hit image group is the first image group, encoding the missed image group such as the second image group, transmitting the encoded image group to the client through an RTP (real-time transport protocol), and then assembling and playing the first image group and the second image group at the client.

Eighth embodiment

The embodiment of the application provides a virtual human image processing method, which is applied to an interactive system shown in fig. 1 and comprises the following steps:

the server acquires a plurality of video frames based on the action parameters of the virtual human, and divides the plurality of video frames into a plurality of image groups according to a preset rule.

The server acquires a preset image group hit by an image group in the plurality of image groups in the cache image group as a first image group and transmits the first image group to the terminal device, wherein the cache image group comprises the plurality of preset image groups.

The server acquires an image group which does not hit the cache image group in the plurality of image groups as an image group to be encoded, encodes a video frame in the image group to be encoded to obtain a second image group, and transmits the second image group to the terminal equipment.

And the terminal equipment plays the first image group and the second image group according to the playing sequence of the first image group and the second image group.

In some embodiments, when the terminal device receives the first image group and the second image group at the same time, the terminal device may play the first image group and the second image group according to the playing order of the first image group and the second image group. Optionally, the terminal device may also assemble the first image group and the second image group, for example, shorten a time interval between the first image group and the second image group to within a specified time length, so as to better play the first image group and the second image group.

In some embodiments, when the terminal device receives the first image group first and does not receive the second image group, the terminal device may determine whether to directly play the first image group according to a preset playing order of the image groups, and may directly play the first image group if the playing order of the first image group is before the playing order of the second image group. If the playing sequence of the first image group is after the second image group, the first image group can be played after the second image group is received and the outer second image group is played. Therefore, the multiple image groups can be played according to the normal playing sequence.

The specific working processes of the server and the terminal device are described above, and refer to the corresponding processes of the method in the embodiments of the present application, which are not described herein again.

Ninth embodiment

Referring to fig. 12, fig. 12 is a block diagram illustrating a virtual human image processing apparatus according to an embodiment of the present application. As will be explained below with respect to a block diagram of modules shown in fig. 12, this virtual human image processing apparatus 900 includes: a video frame acquisition module 910, a group of pictures dividing module 920, a first group of pictures processing module 930, and a first transmission module 940. Wherein:

a video frame acquiring module 910, configured to acquire a plurality of video frames based on the motion parameters of the virtual human.

The image group dividing module 920 is configured to divide the plurality of video frames into a plurality of image groups according to a preset rule.

A first image group processing module 930, configured to acquire, as the first image group, a preset image group hit by an image group in the multiple image groups in a cached image group, where the cached image group includes multiple preset image groups.

A first transmission module 940, configured to transmit the first image group.

Tenth embodiment

In this embodiment, referring to fig. 11 again, the virtual human image processing apparatus 900 further includes:

the second group of pictures processing module 950 is configured to obtain a group of pictures that misses the cached group of pictures in the plurality of groups of pictures, as a group of pictures to be encoded, encode a video frame in the group of pictures to be encoded, obtain a second group of pictures, and transmit the second group of pictures.

A second transmission module 960, configured to transmit the second group of images.

Optionally, the first transmission module 940 is specifically configured to transmit the first image group after the transmission of the second image group is completed, if the playing order of the first image group is later than the playing order of the second image group.

Optionally, the second transmission module 960 is specifically configured to transmit the second group of pictures through an RTP protocol.

Optionally, the image group dividing module 920 includes:

Optionally, the image group dividing module 920 further includes:

Optionally, the first image group processing module 930 is specifically configured to: acquiring a first action parameter corresponding to each image group in a plurality of image groups and a second action parameter corresponding to each preset image group in a plurality of preset image groups; and if the second motion parameter has a target second motion parameter matched with the first motion parameter, determining a preset image group corresponding to the target second motion parameter as a preset image group hit by an image group in the plurality of image groups.

Optionally, the first image group processing module 930 is specifically configured to: and if the target second action parameter with the similarity larger than the similarity threshold exists in the second action parameters, determining that the target second action parameter matched with the first action parameter exists in the second action parameters.

Optionally, the first group of images processing module 930 is further configured to: acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as an initial image group; acquiring a third action parameter corresponding to a first frame picture in the initial image group and a fourth action parameter corresponding to a last frame picture in an image group with a transmission sequence before the initial image group; and if the third motion parameter is matched with the fourth motion parameter, determining the initial image group as the first image group.

Optionally, the number of the video frames in the first image group does not exceed a specified number, wherein each video frame in the first image group corresponds to one playing time, and the interval duration between the playing times corresponding to two adjacent video frames does not exceed a specified duration.

Optionally, the first group of images processing module 930 is specifically configured to transmit the first group of images through a TCP protocol.

Optionally, the video frame acquiring module 910 is specifically configured to: acquiring action parameters of a virtual human; determining whether a preset video frame corresponding to the action parameter exists in the cached video frame according to the action parameter, wherein the cached video frame comprises a plurality of preset video frames; if the motion parameter exists, acquiring a preset video frame corresponding to the motion parameter as a first video frame; if the motion parameters do not exist, inputting the motion parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model; a plurality of video frames is derived based on the first video frame and the second video frame.

Optionally, the video frame acquiring module 910 is specifically configured to: acquiring a specified number of video frames to be transmitted in a transmission sequence before a first video frame; acquiring pixel similarity between a first video frame and each video frame to be transmitted in a specified number of video frames to be transmitted; if the pixel similarity exceeds a pixel similarity threshold, obtaining a plurality of video frames based on the first video frame and the second video frame; and if the pixel similarity does not exceed the pixel similarity threshold, inputting the action parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model.

Optionally, the video frame acquiring module 910 is further configured to acquire input information acting on the virtual human, where the input information includes one or more combinations of voice information, text information, and image information; and inputting the input information into a pre-trained virtual human action model, and acquiring action parameters output by the virtual human action model.

Optionally, the apparatus further includes a transmission module configured to transmit the first image group and the second image group to the client, so as to instruct the client to assemble the first image group and the second image group into the virtual human video, and play the virtual human video.

It can be clearly understood by those skilled in the art that the above devices provided by the embodiments of the present application can implement the methods provided by the embodiments of the present application. The specific working processes of the devices and modules described above may refer to the corresponding processes of the method in the embodiments of the present application, and are not described herein again.

In the embodiments provided in this application, the coupling, direct coupling or communication connection between the modules shown or discussed may be an indirect coupling or communication coupling through some interfaces, devices or modules, and may be in an electrical, mechanical or other form, and the embodiments of this application are not limited to this specifically.

In addition, each functional module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a form of hardware, and can also be realized in a form of a functional module of software.

Eleventh embodiment

Referring to fig. 13, a block diagram of an electronic device 1000 according to an embodiment of the present disclosure is shown. The electronic device 1000 may be a personal computer, a tablet computer, a server, an industrial computer, or the like capable of running an application. The electronic device 1000 in the present application may include one or more of the following components: a processor 1010, a memory 1020, and one or more applications, wherein the one or more applications may be stored in the memory 1020 and configured to be executed by the one or more processors 1010, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 1010 may include one or more processing cores. The processor 1010 interfaces with various components throughout the electronic device 1000 using various interfaces and circuitry to perform various functions of the electronic device 1000 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1020 and invoking data stored in the memory 1020. Alternatively, the processor 1010 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1010 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1010, but may be implemented by a communication chip.

The Memory 1020 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 1020 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1020 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 1000 during use (e.g., phone book, audio-video data, chat log data), and the like.

Twelfth embodiment

Referring to fig. 14, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 1100 has stored therein program code that can be invoked by a processor to perform the methods described in the method embodiments above.

The computer-readable storage medium 1100 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1100 includes a non-volatile computer-readable storage medium. The computer readable storage medium 1100 has storage space for program code 1110 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1110 may be compressed, for example, in a suitable form.

To sum up, the method, the device, the electronic device and the storage medium for processing the virtual human image provided by the embodiment of the application acquire a plurality of video frames through the action parameters based on the virtual human; dividing a plurality of video frames into a plurality of image groups according to a preset rule; acquiring a preset image group hit by an image group in a plurality of image groups in a cache image group as a first image group, and transmitting the first image group, wherein the cache image group comprises the plurality of preset image groups; acquiring an image group of a missed cache image group in the plurality of image groups as an image group to be encoded, and encoding a video frame in the image group to be encoded to obtain a second image group; and the second image group is transmitted, so that a plurality of video frames are divided into image groups as a unit to be hit, compared with the situation that the plurality of video frames are hit as a whole video, the hit rate is greatly improved, meanwhile, for the hit image groups, the terminal equipment can connect the plurality of image groups in series to play as a complete video, and the video encoder does not need to re-encode, so that the transmission efficiency of the virtual human image is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A virtual human image processing method is characterized by comprising the following steps:

acquiring a plurality of video frames based on the action parameters of the virtual human;

dividing the video frames into a plurality of image groups according to a preset rule;

acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as a first image group; the cache image group comprises a plurality of preset image groups;

and transmitting the first image group.

2. The method of claim 1, further comprising:

acquiring an image group which misses the cache image group from the plurality of image groups as an image group to be encoded, and encoding a video frame in the image group to be encoded to obtain a second image group;

and transmitting the second image group.

3. The method of claim 2, wherein said transmitting the first group of images comprises:

4. The method of claim 2, wherein transmitting the second group of images comprises:

and transmitting the second image group through an RTP protocol.

5. The method according to claim 1, wherein the dividing the plurality of video frames into a plurality of groups of pictures according to a preset rule comprises:

acquiring action parameters corresponding to each video frame in a plurality of video frames;

if the action parameter corresponding to the video frame is an effective action parameter, taking the video frame as a tangent point video frame;

dividing the plurality of video frames into a plurality of image groups based on the tangent point video frames.

6. The method according to claim 5, wherein before the video frame is taken as the tangent point video frame if the motion parameter corresponding to the video frame is the valid motion parameter, further comprising:

acquiring a standard image corresponding to the action parameter;

acquiring the probability of generating the standard image based on the action parameters;

and if the probability is not lower than the probability threshold, determining the action parameter as an effective action parameter.

7. The method according to claim 1, wherein the obtaining a preset image group hit by an image group in the plurality of image groups in the cached image group as the first image group comprises:

acquiring a first action parameter corresponding to each image group in the plurality of image groups and a second action parameter corresponding to each preset image group in the plurality of preset image groups;

and if the second motion parameters include target second motion parameters matched with the first motion parameters, determining a preset image group corresponding to the target second motion parameters as a preset image group hit by an image group in the plurality of image groups.

8. The method according to claim 7, wherein before determining a preset image group corresponding to the target second motion parameter as a preset image group hit by an image group of the plurality of image groups if the target second motion parameter matching the first motion parameter exists in the second motion parameters, the method further comprises:

and if a target second action parameter with the similarity larger than a similarity threshold exists in the second action parameters, determining that a target second action parameter matched with the first action parameter exists in the second action parameters.

9. The method according to claim 1, wherein the obtaining a preset image group hit by an image group in the plurality of image groups in the cached image group as the first image group comprises:

acquiring a preset image group hit by the image group in the plurality of image groups in the cache image group as an initial image group;

acquiring a third action parameter corresponding to a first frame picture in the initial image group and a fourth action parameter corresponding to a last frame picture in an image group with a transmission sequence before the initial image group;

and if the third motion parameter is matched with the fourth motion parameter, determining the initial image group as the first image group.

10. The method according to claim 1, wherein the number of the video frames in the first image group does not exceed a specified number, wherein each video frame in the first image group corresponds to a playing time, and a time interval between the playing times corresponding to two adjacent video frames respectively does not exceed a specified time interval.

11. The method of claim 1, wherein transmitting the first group of images comprises:

and transmitting the first image group through a TCP protocol.

12. The method according to any one of claims 1 to 11, wherein the obtaining a plurality of video frames based on the action parameters of the avatar comprises:

acquiring action parameters of the virtual human;

determining whether a preset video frame corresponding to the action parameter exists in a cached video frame according to the action parameter, wherein the cached video frame comprises a plurality of preset video frames;

if the action parameter exists, acquiring a preset video frame corresponding to the action parameter as a first video frame;

if the motion parameters do not exist, inputting the motion parameters into a machine learning model trained in advance, and acquiring a second video frame output by the machine learning model;

obtaining the plurality of video frames based on the first video frame and the second video frame.

13. The method of claim 12, wherein deriving the plurality of video frames based on the first video frame and the second video frame comprises:

acquiring a specified number of video frames to be transmitted in the transmission sequence before the first video frame;

acquiring pixel similarity between the first video frame and each video frame to be transmitted in the specified number of video frames to be transmitted;

if the pixel similarity exceeds a pixel similarity threshold, obtaining the plurality of video frames based on the first video frame and the second video frame;

and if the pixel similarity does not exceed a pixel similarity threshold, executing the action parameters to be input into a pre-trained machine learning model, and acquiring a second video frame output by the machine learning model.

14. The method of claim 12, wherein the obtaining of the action parameters of the avatar comprises:

acquiring input information acting on the virtual human, wherein the input information comprises one or more combinations of voice information, text information and image information;

and inputting the input information into a pre-trained virtual human action model, and acquiring action parameters output by the virtual human action model.

15. A virtual human image processing method is applied to an interactive system, the interactive system comprises a server and a terminal device, and the method comprises the following steps:

the server acquires a plurality of video frames based on the action parameters of the virtual human and divides the video frames into a plurality of image groups according to a preset rule;

the server acquires a preset image group hit by an image group in the plurality of image groups in a cache image group as a first image group and transmits the first image group to the terminal equipment, wherein the cache image group comprises the plurality of preset image groups;

the server acquires an image group which misses the cache image group from the plurality of image groups and takes the image group as an image group to be encoded, encodes a video frame in the image group to be encoded to obtain a second image group, and transmits the second image group to the terminal equipment;

16. A avatar image processing apparatus, comprising:

the video frame acquisition module is used for acquiring a plurality of video frames based on the action parameters of the virtual human;

the image group dividing module is used for dividing the video frames into a plurality of image groups according to a preset rule;

the first image group processing module is used for acquiring a preset image group hit by an image group in the plurality of image groups in a cache image group as a first image group, wherein the cache image group comprises a plurality of preset image groups;

17. An electronic device, comprising:

a memory;

one or more processors coupled with the memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-15.

18. A computer-readable storage medium having program code stored therein, the program code being invoked by a processor to perform the method of any one of claims 1 to 15.