CN111614925B

CN111614925B - Figure image processing method and device, corresponding terminal and storage medium

Info

Publication number: CN111614925B
Application number: CN202010432534.7A
Authority: CN
Inventors: 谢新林; 邹超洋; 张书瑞; 何卓鹏
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2022-04-26
Anticipated expiration: 2040-05-20
Also published as: CN111614925A

Abstract

The application discloses a person image processing method and device, a corresponding terminal and a storage medium. Intercepting a current person image frame of a speaker from a current live-action image frame captured by a camera; determining current state deformation information for cartoon display of a speaker according to the current character image frame and the predetermined reference character image and the reference cartoon image; and feeding back the current state deformation information to the listening and speaking terminal so that the listening and speaking terminal determines and displays the current cartoon image frame corresponding to the speaker according to the current state deformation information. By using the method, the master terminal side provides state deformation information for the communication of the master on the listening and speaking terminal side, the communication of the master on the listening and speaking terminal side reduces the leakage risk of identity information of the master, meanwhile, the effect of video content activity is enhanced, and compared with the existing high-dimensional image transmission, the method only transmits the state deformation information, and effectively ensures the image transmission quality of the cloud video when the network bandwidth is low.

Description

Figure image processing method and device, corresponding terminal and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for processing a human image, a corresponding terminal, and a storage medium.

Background

With the development of science and technology, devices such as mobile phones and computers have become essential tools for real-time communication in daily life. With the application of the high-speed 5G network, cloud video services such as cloud conferences and cloud classrooms are widely applied, so that people can enjoy more cloud network resources without going out.

In the current cloud video service, the video terminals in the service mode can be a talkback terminal serving as a talkback party and a listen-and-talk terminal serving as a listen-and-talk party, and the number of the listen-and-talk terminals can be multiple.

However, in some cloud video services such as a cloud classroom or a cloud conference, a speaker who is a lecture teacher or a conference speaker may worry about leakage of personal identity information and may not want to present a real portrait of the speaker to a listener; meanwhile, teachers give lessons or long-time meetings may cause listeners to feel boring, so that the efficiency of cloud classes or cloud meetings is greatly reduced; in addition, most of the current cloud video services are transmission of high-definition video image frames, which has a very high requirement on network bandwidth, and the image transmission quality of the cloud video is affected when the network bandwidth is limited.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for processing a personal image, a corresponding terminal, and a storage medium, so as to implement cartoon display of a personal image of a speaker in a cloud video service on a listener.

In a first aspect, an embodiment of the present application provides a person image processing method, which is applied to a speaker terminal, and includes:

intercepting a current person image frame of a speaker from a current live-action image frame captured by a camera;

determining current state deformation information for cartoon display of the speaker according to the current character image frame and the predetermined reference character image and reference cartoon image;

and feeding back the current state deformation information to a listening and speaking terminal so that the listening and speaking terminal determines and displays a current cartoon image frame corresponding to the speaker according to the current state deformation information, wherein the listening and speaking terminal and the speaker terminal are in cloud video communication.

Further, the determining, according to the current human image frame in combination with a predetermined reference human image and a reference cartoon image, current state deformation information for cartoon-like displaying the speaker includes:

detecting key points in the current human image frame to form a corresponding current key point sequence;

and according to the current key point sequence, combining the character image processing information of the reference character image and the cartoon image processing information of the reference cartoon image, and determining the current state deformation information for cartoon display of the speaker.

Further, the detecting the keypoints in the current human image frame to form a corresponding current keypoint sequence includes:

carrying out key point detection on the current human image frame to form a first collection key point sequence containing a first set number of human face key points and a second set number of identification key points;

and performing key point smoothing processing on the first collected key point sequence to obtain a first current key point sequence of the current human image frame.

Further, the character image processing information includes a first character subdivision array formed based on a third set number of key point triangulation for the reference character image, and the cartoon image processing information includes a first cartoon subdivision array formed based on a third set number of key point triangulation for the reference cartoon image, where the third set number is a sum of the first set number and a second set number;

correspondingly, the determining, according to the current key point sequence, the current state deformation information for cartoon displaying the speaker by combining the predetermined character image processing information of the reference character image and the cartoon image processing information of the reference cartoon image includes:

triangulation is carried out on the first current key point sequence by adopting a set triangulation rule, and a first current subdivision array of the current person image frame is obtained;

and according to the first current subdivision array, combining the first person subdivision array and the first cartoon subdivision array, determining a first target subdivision array required by the speaker through affine transformation, and using the first target subdivision array as the current state deformation information of the speaker.

carrying out key point detection on the current human image frame to form a second collected key point sequence containing a first set number of human face key points;

and performing key point smoothing processing on the second collected key point sequence to obtain a second current key point sequence of the current human image frame.

Further, the character image processing information includes a second character subdivision array formed on the basis of triangulation of the first set number of key points for the reference character image, and the cartoon image processing information includes a second cartoon subdivision array formed on the basis of triangulation of the first set number of key points for the reference cartoon image

determining the current pose estimation information of the speaker according to the second current key point sequence;

triangulation is carried out on the second current key point sequence by adopting a set triangulation rule, and a second current subdivision array of the current person image frame is obtained;

according to the second current subdivision array, combining the second character subdivision array and a second cartoon subdivision array, and determining a second target subdivision array required by the cartoon display speaker through affine transformation;

and taking the current pose estimation information and the second target subdivision array as the current state deformation information of the speaker.

Further, the reference person image is a front face image which is acquired for the speaker through the camera in advance; alternatively, the reference human image is a human image frame preceding the current human image frame.

In a second aspect, an embodiment of the present application provides a person image processing method, which is applied to a listening and speaking terminal, and includes:

receiving current state deformation information fed back by a main speaking terminal, and acquiring image attribute information of a predetermined reference cartoon image, wherein the current state deformation information is determined by the method of the first aspect of the embodiment of the application;

and forming a current cartoon image frame of the speaker corresponding to the speaker terminal according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time.

Further, when the current state deformation information is a first target subdivision array required by the speaker cartoon display, the image attribute information comprises cartoon image texture information and a first cartoon subdivision array formed by triangulation of the reference cartoon image based on a fourth set number of key points;

correspondingly, the forming and real-time displaying of the current cartoon image frame of the speaker corresponding to the speaker terminal according to the current state deformation information and the image attribute information includes:

determining a first target affine transformation matrix of each triangular patch forming the reference cartoon image according to the first target subdivision array and the first cartoon subdivision array;

and combining the cartoon image texture information based on the first target affine transformation matrixes to form a current cartoon image frame of the speaker and display the current cartoon image frame in real time.

Further, when the current state deformation information is the current pose estimation information of the speaker and a second target subdivision array required by cartoon display, the image attribute information comprises cartoon image texture information and a second cartoon subdivision array formed by triangulation of the reference cartoon image based on a fifth set number of key points;

determining a second target affine transformation matrix of each triangular patch forming the reference cartoon image according to the second target subdivision array and the second cartoon subdivision array;

combining the cartoon image texture information based on the second target affine transformation matrixes to form cartoon expression image frames of the speaker;

and forming the current cartoon image frame of the speaker according to the current pose estimation information and the cartoon expression image frame and displaying the current cartoon image frame in real time.

Further, the reference cartoon image is selected and sent by a speaker of the speaker terminal in advance, and the image attribute information of the reference cartoon image is prestored in a set attribute information table.

In a third aspect, an embodiment of the present application provides a personal image processing apparatus, including:

the figure image acquisition module is used for intercepting the current figure image frame of the speaker from the current live-action image frame captured by the camera;

the image deformation determining module is used for determining the deformation information of the current state of the speaker for cartoon display according to the current character image frame and the predetermined reference character image and reference cartoon image;

and the deformation information transmission module is used for feeding the current state deformation information back to the listening and speaking terminal so that the listening and speaking terminal determines and displays the current cartoon image frame corresponding to the speaker according to the current state deformation information, wherein the listening and speaking terminal and the main speaking terminal are in cloud video communication.

In a fourth aspect, an embodiment of the present application provides a human image processing apparatus, including:

the deformation information receiving module is used for receiving current state deformation information fed back by the talkback terminal and acquiring image attribute information of a predetermined reference cartoon image, wherein the current state deformation information is determined by the device in the third aspect of the embodiment of the application;

and the cartoon image conversion module is used for forming a current cartoon image frame of the speaker corresponding to the speaker terminal according to the current state deformation information and the image attribute information and displaying the current cartoon image frame in real time.

In a fifth aspect, an embodiment of the present application provides a walkie-talkie terminal, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps according to the first aspect of embodiments of the present application.

In a sixth aspect, an embodiment of the present application provides a listening and speaking terminal, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps according to the second aspect of the embodiments of the present application.

In a seventh aspect, embodiments of the present application further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the method steps according to the first aspect and/or the second aspect.

The character image processing method, the character image processing device, the corresponding terminal and the storage medium are mainly applied to a talkback terminal and a listen-and-talk terminal for video interaction in cloud video service, and at the talkback terminal side, the current character image frame of a talkback is firstly intercepted from the current live-action image frame captured by a camera; then, according to the current character image frame, combining a predetermined reference character image and a reference cartoon image, determining current state deformation information for cartoon display of a speaker; and finally, feeding the current state deformation information back to the listening and speaking terminal. Correspondingly receiving current state deformation information fed back by the main speaking terminal at the listening and speaking terminal side, and acquiring image attribute information of a predetermined reference cartoon image; and finally, forming a current cartoon image frame of the speaker corresponding to the speaker terminal side according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time. The above-mentioned technical scheme that provides of this application embodiment, the effectual cartoon that has realized main speaker terminal side main speaker and at listening terminal side shows, compare with the current direct people image frame real-time transmission who gives the main speaker to listening terminal, the personage image of cartoon has reduced main speaker identity information's leakage risk, the cartoon that simultaneously gives in cloud classroom teaching or the cloud meeting main speaker shows also possesses the effect that strengthens video content activeness, furthermore, main speaker terminal only need to feed back a state deformation information to listening terminal and just can realize main speaker's cartoon shows, need not to transmit high-dimensional image, the image transmission quality of cloud video when effectively having guaranteed network bandwidth is lower.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic flowchart of a human image processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a person image processing method according to a second embodiment of the present application;

FIG. 3 is a diagram illustrating an implementation example of determining deformation information of a current state in a person image processing method according to the second embodiment;

FIG. 3a is a prior art triangulation diagram of key points of a face identifying the mouth;

FIG. 3b is a triangular grid diagram formed by triangulation of key points of a face identifying a mouth according to the present embodiment;

fig. 3c to fig. 3e are diagrams showing an example of a reference character image, a current character image frame and a reference cartoon image in the character image processing method according to the second embodiment, respectively;

FIGS. 3 f-3 h show the mesh display diagrams of the triangulation formed by the corresponding reference character image, the current character image frame and the reference cartoon image, respectively;

fig. 3i is a diagram showing an example of polynomial curve fitting in the human image processing method provided in the present embodiment;

FIG. 3j shows a representation of a cartoon image frame formed using keypoints before fitting with a non-polynomial based curve;

FIG. 3k is a cartoon image frame with a polynomial curve fit;

FIG. 4 is a diagram illustrating another implementation example of determining deformation information of a current state in a human image processing method according to the second embodiment;

fig. 5 is a schematic flowchart of a human image processing method according to a third embodiment of the present application;

FIG. 6 is a flowchart of an implementation of cartoon display of a person in the person image processing method according to the second embodiment of the present application;

6 a-6 d are diagrams showing the implementation of the method for processing a character image according to the embodiment, wherein the reference cartoon image is used to obtain the current cartoon image frame;

FIG. 7 is a flowchart illustrating another implementation of cartoon display of a person in the person image processing method according to the second embodiment of the present application;

FIG. 7a is a diagram showing a global effect after executing the human image processing method provided by the present application;

FIG. 7b is a diagram showing the effect of the speaker cartoon display in conjunction with the current state deformation information formed by separating the actions and expressions;

fig. 8 is a block diagram of a human image processing apparatus according to a fourth embodiment of the present application;

fig. 9 is a block diagram of a human image processing apparatus according to a fifth embodiment of the present application;

fig. 10 is a schematic diagram of a hardware structure of a main speaking terminal according to a sixth embodiment of the present invention;

fig. 11 is a schematic diagram of a hardware structure of a listening and speaking terminal according to a seventh embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings. It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, nor should be construed to indicate or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Example one

Fig. 1 is a schematic flowchart of a character image processing method according to an embodiment of the present application, where the method is suitable for implementing a cartoon display of a speaker at a speaker terminal side on a listening/speaking terminal side in a cloud video service. The method can be executed by computer equipment supporting cloud video services, the computer equipment is specifically regarded as a main speaking terminal for performing cloud video communication in the cloud video services, and the main speaking terminal can be composed of two or more physical test questions or a physical entity. Generally, the intercom terminal may be a notebook, a desktop computer, a smart tablet, a smart mobile terminal, or the like.

As shown in fig. 1, a method for processing a human image according to a first embodiment of the present invention specifically includes the following operations:

it should be noted that, in the application scenario of the method provided in this embodiment, in the cloud video communication, the speaker performs video communication with the listener at the listening and speaking terminal side through the speaker and speaking terminal. When the speaker or the audiphone wants to display the character image of the speaker in a cartoon mode on the audiphone terminal side, the speaker terminal side can adopt the method provided by the embodiment to process the character image corresponding to the speaker.

S101, intercepting a current person image frame of a speaker from a current live-action image frame captured by a camera.

In this embodiment, the camera may be understood as an image capturing device disposed on a main speaking terminal serving as the execution subject and used for capturing an image, and the current live-action image frame may be specifically understood as a live-action image frame captured by the camera at the current moment in a cloud video communication process between the main speaking terminal and a listening and speaking terminal, where the current live-action image frame includes face image information of a main speaker.

In the step, the captured live-action image frame can be used as input data of the face detection neural network model, so that the position of the main speaker face area in the current live-action image frame is positioned through the processing of the face detection neural network model, and finally the current figure image frame of the main speaker is obtained through intercepting based on the position information.

And S102, determining the current state deformation information for cartoon display of the speaker according to the current person image frame and the predetermined reference person image and reference cartoon image.

In this embodiment, the reference character image may be specifically understood as an image including face information of a speaker under a set condition, and is mainly used for comparing the face information of the speaker included in the reference character image frame with the face information of the speaker included in the current character image frame to determine changes of the expression and the motion of the speaker at the current time. The reference cartoon image can be specifically understood as an image containing face information of a preselected cartoon character, the selected cartoon character can be understood as a cartoon image to be displayed by a speaker at a listening and speaking terminal side, the face information contained in the reference cartoon image is equivalent to the original expression and action information of the cartoon character, and the image can be specifically used as a carrier for cartoon display of the speaker to present various expressions and states exhibited by the speaker in video communication.

In this embodiment, the current state deformation information may be specifically understood as information required by the speaker at the listening and speaking terminal side in a cartoon image, where the current state deformation information specifically includes information that the speaker changes in expression and motion relative to the face information in the reference character image at the current time, and the information is suitable for the listening and speaking terminal to deform on the reference cartoon image to form a current cartoon image frame corresponding to the current character image frame.

For example, the current state deformation information may be a triangulation array in which each element is represented by a set of vertices of a triangle, and the vertex coordinates of each triangle in the triangulation array may be obtained by affine transformation processing of a sequence of key points representing the expression and the action pose in the current character image frame and a sequence of key points representing the expression and the action pose in the reference character image. The information of the elements in the triangulation array can be used for representing the change of the current expression and action of the speaker relative to the expression and action of the speaker in the reference character image.

In addition, the current state deformation information can also be composed of a triangulation array and pose estimation information, wherein the triangulation array is still represented by a set of fixed points of a triangle, but is only used for representing the change of the current expression of the speaker relative to the expression of the speaker in the reference character image; the current motion can be represented by pose estimation information relative to the motion change of the speaker in the reference character image, wherein the pose estimation information can include parameter information representing rigid motion of the head of the speaker, and the specific parameter information can be three euler angle information and the like.

It can be known that the purpose of this step is to transmit the facial expression and movement of the speaker to the reference cartoon image, so as to realize the display of the current cartoon image using the reference cartoon image as a carrier, wherein the determination of the current state deformation information is the key of this step.

Specifically, the determining process of the current state deformation information in this step can be summarized as follows: firstly, detecting key points capable of identifying face image information and face posture information in a current face image frame by adopting a certain method; similarly, the association points which identify the corresponding face information in the predetermined reference face image and the key points which identify the corresponding cartoon face information in the reference cartoon image also need to be obtained; then, the area subdivision (such as triangulation of images) of each image can be realized through the corresponding key points of each image, and subdivision area information (which can be represented by a triangulation array) corresponding to each image subdivision can be obtained; then, through the conversion processing of each piece of subdivision region information corresponding to the current person image frame and each piece of subdivision region information corresponding to the reference person image, the conversion information corresponding to the conversion of the expression and the action of the person in the reference person image to the expression and the action of the person in the current person image frame is determined; finally, according to the determined transformation information and the subdivision region information corresponding to the reference cartoon image, all region subdivision information required when the current cartoon image frame is formed can be reversely deduced, and all region subdivision information can be determined to be a part or all of current state deformation information.

It should be noted that, when the determined total region subdivision information is only a part of the current state deformation information, a pose estimation operation needs to be further performed according to key points of the face of the speaker in the image frame, so as to obtain pose estimation information (which may include parameter values such as pitch, roll, and yaw) as another part of the current state deformation information.

In the present embodiment, a front face image of a speaker captured by a camera after the entry into the cloud video communication is started may be used as a reference person image, and for example, after the cloud video application is logged in through the speaker terminal, the speaker may be posed according to a prompt in the application to capture one front face image as the reference person image through the camera.

When the front face of the speaker is used as the reference person image, the speaker terminal side may perform key point detection on the reference person image in advance and area subdivision (triangulation) on the reference person image based on the detected key points, and store vertex coordinate information of each subdivision area as subdivision area information.

In this embodiment, in the cloud video communication stage, a previous human image frame at a previous time from the current time may be used as the reference human image, and thus, each piece of segmentation area information corresponding to the previous human image frame after segmentation may be used as the segmentation area information of the reference human image.

It should be noted that, in the embodiment, the reference cartoon image may be included in one cartoon image packet, in a design stage of the cloud video application product, a designer may design a plurality of cartoon character images, and may perform corresponding area subdivision on each cartoon character image to obtain respective subdivision area information, and finally may package each cartoon image and corresponding subdivision area information to form one cartoon image packet integrated in the cloud video application product, so that each cartoon image and corresponding subdivision area information are prestored on a terminal (which may be a talkback terminal or a listen and talk terminal) in which the cloud video application product is installed.

In practical application, after a speaker enters cloud video application through a speaker terminal, an initial reference cartoon image can be selected from a cartoon image package, and in addition, after a listener enters cloud video application through a listener terminal, the initial reference cartoon image can be selected from the cartoon image package as a reference cartoon image, and identification information of the reference cartoon image is fed back to the speaker terminal, so that the speaker terminal knows relevant information of the reference cartoon image (such as which cartoon image and corresponding subdivision area information are, and the like).

The selection of the subsequent reference cartoon image is mainly determined based on the selection of the reference character image, and if the front face image of the speaker is always selected as the reference character image, the initially selected cartoon image can be correspondingly always used as the reference cartoon image; if a previous human image frame relative to the current human image frame is subsequently selected as the reference human image, the previous cartoon image frame corresponding to the previous human image frame is correspondingly required to be used as the reference cartoon human image.

S103, feeding the current state deformation information back to a listening and speaking terminal, so that the listening and speaking terminal determines and displays a current cartoon image frame corresponding to the speaker according to the current state deformation information.

In this embodiment, the talkback terminal and the listen-and-talk terminal, which are the execution subjects of this embodiment, maintain cloud video communication, and this step may feed back the determined current state deformation information to the listen-and-talk terminal.

It should be noted that, when there are multiple listening and speaking terminals in the cloud video communication, and therefore, when there are multiple listening and speaking terminals, different cartoon images may be selected by different listeners as reference cartoon images, at this time, the person image processing on the side of the main speaking terminal will consume a large amount of computing resources, and in order to reduce the resource occupation of the person image processing, the main speaking terminal preferably determines the reference cartoon image in this embodiment.

According to the figure image processing method provided by the embodiment of the application, the master speaker terminal side provides state deformation information for the master speaker in the communication display of the listening and speaking terminal side, the master speaker in the communication display of the listening and speaking terminal side reduces the leakage risk of identity information of the master speaker, meanwhile, the effect of video content activity is enhanced, and compared with the existing high-dimensional image transmission, the scheme only transmits the state deformation information, and the image transmission quality of cloud video when the network bandwidth is low is effectively guaranteed.

Example two

Fig. 2 is a schematic flow chart of a human image processing method according to a second embodiment of the present application, where the second embodiment is optimized based on the above embodiments, and further determines current state deformation information for cartoon display of the speaker according to the current human image frame by combining a predetermined reference human image and a predetermined reference cartoon image, where the method is embodied as follows: detecting key points in the current human image frame to form a corresponding current key point sequence; and according to the current key point sequence, combining the character image processing information of the reference character image and the cartoon image processing information of the reference cartoon image, and determining the current state deformation information for cartoon display of the speaker.

As shown in fig. 2, the second embodiment of the present invention provides a method for processing a human image, which specifically includes the following operations:

s201, intercepting the current person image frame of the speaker from the current live-action image frame captured by the camera.

S202, detecting key points in the current human image frame to form a corresponding current key point sequence.

In this embodiment, in order to determine the current state deformation information corresponding to the current human image frame, the current human image frame may be subjected to keypoint detection according to a keypoint detection method, so that a corresponding current keypoint sequence may be formed based on the detected keypoints, where each detected keypoint has corresponding semantic information in addition to the coordinate position of the keypoint in the image, where the semantic information specifically indicates the meaning that the keypoint actually has when representing the facial features or contour, for example, the semantic information of the xth keypoint in one keypoint sequence may be the position of the face under chin.

In this embodiment, according to different determination manners of the current state deformation information, the number of the determined key points is different when the key point detection is performed in this step, for example, if the current state deformation information is directly determined only by a subdivision region formed by the key points, the detected key points include not only the face key points for identifying the facial features and the facial contours but also the identification key points for identifying the facial position state; if the subdivision region formed by the key points and the person pose information determined by the key points are required to be used as the current state deformation information, the detected key points only need to comprise the face key points for identifying the facial features and the outline.

Further, the detecting the keypoints in the current human image frame to form a corresponding current keypoint sequence may be optimized as: carrying out key point detection on the current human image frame to form a first collection key point sequence containing a first set number of human face key points and a second set number of identification key points; and performing key point smoothing processing on the first collected key point sequence to obtain a first current key point sequence of the current human image frame.

Specifically, in the preferred operation of this embodiment as compared with the above-mentioned operation in S202, firstly, a key point detection algorithm is used to detect a first set number of face key points from the current human image frame, the first set number is preferably 68, and illustratively, the face key points of the first set number detected by the embodiment, which may be used to represent facial features and contour information of a person (e.g., a speaker), the first set number of facial keypoints may be represented by an array including coordinate information of the respective keypoints, e.g., (x1, y1, x2, y2... x68, y68), the face key point determined by the above method may have a fixed key point serial number, for example, the key point with the key point serial number of 9 may specifically be a key point representing the chin of a person, and for example, the key point with the key point serial number of 31 may specifically be a key point representing the tip of the nose of the person.

Meanwhile, a second set number of identification key points may also be detected during the key point detection, where the second set number is preferably 6, and the 6 identification key points may be 4 vertices of the current human image frame (for example, 4 vertices of the image frame, which are seated on, above right, below left, and below right) and 2 coordinate points of the human cheek, respectively.

And then, smoothing the key points in the first collection key point sequence to avoid the jitter problem when the cartoon portrait is displayed. The reason why the key points are smoothed as described above is that: the current character image frame is only one image frame acquired at the current moment in cloud video communication, and the relevance of the data of the front and rear image frames is considered, so that the coordinate positions of key points with the same semantic information in adjacent image frames are often greatly different, and the difference affects the calculation of subsequent cartoon deformation, so that the problem of jitter of cartoon image animations displayed on the listening and speaking terminal side exists.

To solve the problem, the above operations of this embodiment further incorporate a key point sequence of a previous human image frame of the current human image frame, where the key point sequence may be denoted as a first previous key point sequence, and includes a first set number of human face key points and a second set number of identification key points corresponding to the previous human image frame. In this embodiment, the first previous keypoint sequence and the determined first acquisition keypoint sequence are subjected to smoothing processing of keypoints through kalman filtering, so that the keypoint sequence corresponding to each image frame in the entire video sequence in cloud video communication is always in a stable and smooth state. In addition, in this embodiment, the key points after the smoothing processing are subjected to curve smoothing again by using a principal component analysis algorithm, so as to prevent the detected partial key points from being lattice points, and finally, the key point sequence after the smoothing processing is determined as a first current key point sequence.

S203, according to the current key point sequence, combining the pre-determined character image processing information of the reference character image and the cartoon image processing information of the reference cartoon image, determining the current state deformation information for cartoon display of the speaker.

The step is equivalent to a process of determining current state deformation information by combining a current key point sequence, and predetermined character image processing information and cartoon image processing information, wherein the character image processing information actually comprises a character subdivision array formed by triangulation on the corresponding key point sequence, and the cartoon image processing information actually comprises a cartoon subdivision array formed by triangulation on the corresponding key point sequence.

In the specific implementation of this step, the number of key points included in the current key point sequence needs to be considered, and if the current key point sequence includes not only the key points of the face but also the identification key points, it can be considered that the person subdivision array and the cartoon subdivision array are respectively formed on the basis of the key point sequence with the same number as the key points included in the current key point sequence; then, a corresponding subdivision array can be obtained through triangulation of the current key point sequence, wherein elements in the character subdivision array and the cartoon subdivision array are respectively three vertex coordinates of a triangle formed on the basis of the respective corresponding associated point sequences, and an array subscript corresponding to each element can be equivalent to a triangle serial number of the triangle corresponding to the element; finally, in the step, a target subdivision array required by the cartoon speaker can be determined through affine transformation based on subdivision arrays corresponding to the images, and the target subdivision array can be used as current state deformation information.

In addition, if the current key point sequence only comprises the key points of the human face, the character subdivision array and the cartoon subdivision array are respectively formed only based on triangulation of the key point sequence formed by the corresponding key points of the human face; then, the current pose information of the speaker is determined based on the current key point sequence, and a subdivision array formed after triangulation is performed on the current key point sequence only containing the face key points is obtained, so that a target subdivision array required by the cartoon display speaker is obtained, and finally the obtained target subdivision array and the current pose information can be used as current state deformation information together.

On the basis of the optimization, the character image processing information is further optimized to comprise a first character subdivision array formed by triangulation of the reference character image based on the key points of the third set number, the cartoon image processing information comprises a first cartoon subdivision array formed by triangulation of the reference cartoon image based on the key points of the third set number, and the third set number is the sum of the first set number and the second set number.

It should be noted that, in the cartoon display processing of each current person image frame, the determination of the reference cartoon image and the subdivision array corresponding to the reference person image is not performed, and these steps can be obtained by preprocessing before the cartoon display processing is performed on each image frame of the video as a preprocessing step, so that the first person subdivision array and the first cartoon subdivision array are obtained in advance by triangulation on the first person key point sequence and the first cartoon key point sequence, respectively, where the first person key point sequence and the first cartoon key point sequence both have the third set number of key points, and actually include the face key points corresponding to the first set number and the identification key points corresponding to the second set number.

Correspondingly, fig. 3 shows an implementation example of determining the current state deformation information in the person image processing method provided in the second embodiment, and as shown in fig. 3, the embodiment further embodies the following steps of determining the current state deformation information for cartoon-displaying the speaker according to the current key point sequence and by combining the predetermined person image processing information of the reference person image and the cartoon image processing information of the reference cartoon image.

The following steps in this embodiment are equivalent to a preferred operation in S203, and the operation is based on the premise that the current human image frame corresponds to the first current keypoint sequence, the reference cartoon image corresponds to the first cartoon subdivision array, and the reference human image corresponds to the first human subdivision array.

S2031, triangulation is carried out on the first current key point sequence by adopting a set triangulation rule, and a first current subdivision array of the current person image frame is obtained.

Specifically, the first current keypoint sequence may be triangulated by a Delaunay triangulation method, so that the current human image frame is subdivided into a plurality of triangular patches according to the keypoints. In the subdivision process, each triangular patch can be numbered, and the vertex coordinate values of three key points forming each triangular patch can be obtained, and in the step, a two-dimensional first current subdivision array is formed based on the number of each triangular patch and the vertex coordinate values of the corresponding three key points, wherein the number of rows in the two-dimensional array represents the number of the formed triangular patches, one row represents one triangular patch, and each row represents the coordinate value of one corresponding key point.

It should be noted that, in the present embodiment, it is considered that when key points forming five sense organs are dissected differently, the influence on the result of the deformation from a person to a cartoon is relatively large, so that some regularization may be set on the basis of Delaunay triangulation to normalize the dissection form of the key points of the five sense organs, and the regularization may be a semantic rule of the key points, for example, for a mouth, the set semantic rule may be that the key points forming a lip contour form an adjacent triangular surface.

For example, fig. 3a shows a prior triangulation graph formed by triangulating key points of a face identifying a mouth; FIG. 3b is a triangular grid diagram formed by triangulation of key points of a face identifying a mouth according to the present embodiment; comparing fig. 3a with fig. 3b, it can be found that the triangular mesh in fig. 3a cannot completely depict the shape of the whole lip, especially the middle two lip lines, which may cause the lip distortion of the cartoon character to be displayed at the listening and speaking terminal side after the cartoon is deformed to be large; and the triangular mesh in fig. 3b completely depicts the shape of the whole lip, so that the naturalization of the lips of the cartoon character to be displayed at the listening and speaking terminal side after the cartoon is deformed can be ensured.

Based on the operation of the step, a first current subdivision array of the current character image frame can be obtained, and meanwhile, a first character subdivision array corresponding to a reference character image and a first cartoon subdivision array corresponding to a reference cartoon character which are determined in advance can also be obtained. Fig. 3c to fig. 3e respectively show an exemplary diagram of a reference character image, a current character image frame and a reference cartoon image in the character image processing method provided in the second embodiment, as shown in fig. 3c to fig. 3e, each detected key point is identified in each image, and the method according to the second embodiment can further perform triangulation of the image based on the key points of each image. Fig. 3f to 3h show the mesh display diagrams of the triangulation formed by the corresponding reference character image, the current character image frame and the reference cartoon image, respectively.

S2032, according to the first current subdivision array, combining the first person subdivision array and the first cartoon subdivision array, determining a first target subdivision array required by the speaker through affine transformation, and using the first target subdivision array as the current state deformation information of the speaker.

The analysis specifically realized in this step is as follows:

for each row of triangle patch information (specifically including a triangle number and coordinate values of each vertex of the triangle, where the triangle number is equivalent to a row index value of the row) in the first current subdivision array, triangles with the same triangle number in the first human object subdivision array and the first cartoon subdivision array may be used as corresponding triangle patches, and it can be known that each number of triangle patches included in each subdivision array is the same. In this step, a single triangular patch is taken as a processing object, and each triangular patch is processed in the same manner.

The present embodiment describes the analysis process of affine transformation by taking a triangle patch with a triangle number i as an example:

first, the triangular patches corresponding to the reference human image and the current human image frame satisfy the following relational expression,

wherein v is_iAnd

the three vertex coordinates in the triangular patches with the numbers i of the reference character image and the current character image frame are respectively represented, A represents a 2 x2 affine transformation matrix, and d is a 2 x1 translation vector. The expression of the three vertexes is expressed in a vector form after being expanded and sorted:

wherein V ═ V₂-v₁v₃-v₁]，

For each triangle, an affine transformation matrix can be calculated

Similarly, under the aspect of the cartoon image transformed from the deformation, affine transformation matrixes exist between each triangular patch in the reference cartoon image and the corresponding triangular patch in the current cartoon image frame to be displayed at the listening and speaking terminal in the embodiment

Namely existence of

Wherein, the ratio of V,

V_dstand respectively representing the triangle arrays corresponding to any triangle patch in the reference character image, the current character image frame and the reference cartoon image. In the above

Therein only have

And the unknown quantity to be solved represents the triangle array of the triangle patch corresponding to the current cartoon image frame, which is equivalent to the object to be solved for carrying out the current state deformation information in the step by only using the triangle array of the triangle patch corresponding to the current cartoon image frame.

For the above

The solution can be performed by quadratic programming, and the optimization objective set in the quadratic programming can be expressed as follows:

min||Gx-h||²

s.t.x1＝(0,0),

x2＝(0,w),

x3＝(h,0),

x4＝(h,w),

(0,0)≤x≤(h,w)

wherein, here

Are all known quantities, thereby addressing the foregoing

The solving of the method is converted into the solving of the x in the optimization target, and the embodiment finally obtains all the key point coordinates corresponding to the current cartoon image frame to be displayed on the listening and speaking terminal side by performing the transformation processing on each triangular patch in the subdivision array.

In addition, equality and inequality constraints are added in the optimization target calculation, specifically, the equality constraints mainly ensure that the first 4 key points corresponding to the current cartoon image frame after deformation are always located at four corner points of the image, namely x1-x4 respectively represent the upper left corner, the upper right corner, the lower left corner and the lower right corner of the image, and h and w in constraint conditions respectively represent the height and width of the image. The inequality constraint further ensures that all the key points corresponding to the current cartoon image frame are within the image range, and avoids the deformed key points from exceeding the image range. These equality and inequality constraints make the solution of the above-mentioned minimized objective function more accurate.

In the present embodiment, all the key points of the current cartoon image frame to be displayed on the listening and speaking terminal side can be obtained through the above steps, since the affine transformation directly solves the coordinates of each key point in the current cartoon image frame to be displayed, if the current cartoon image frame is directly formed based on these key points, there is distortion of the mouth or eyes of a person, and the present embodiment solves the distortion problem by performing polynomial curve fitting on the key points of the mouth and the key points of the eyes (upper and lower eyelids, left and right eyebrows, and the like) in the obtained key points to make the distribution of the key points as smooth as possible.

Wherein the polynomial may be defined as:

where M is the order of the polynomial, w₀,...,w_MBeing polynomialAnd (4) the coefficient. In practical applications, after knowing that the obtained key point coordinate sequences (x1, y1, x2, y2..) are solved, corresponding polynomial coefficients may be solved, fig. 3i provides an exemplary diagram of polynomial curve fitting in the human image processing method provided in this embodiment, as shown in fig. 3i, each key point 21 is distributed as shown in the diagram, the curve 22 may be obtained by fitting the polynomial to each key point, and in a case where the abscissa x of each key point 21 is fixed, the ordinate y of each key point 21 may be updated according to the fitted curve 22. Therefore, fig. 3j shows a display diagram of a cartoon image frame formed by using the key points before the polynomial-based curve fitting is adopted, fig. 3k shows an effect display diagram of a cartoon image frame formed by using the key points after the polynomial-based curve fitting is adopted, and by comparing the cartoon image frame in fig. 3j with the cartoon image frame in fig. 3k, it can be found that the mouth opening shape of the cartoon image frame formed by using the key points after the polynomial fitting in fig. 3k is closer to the mouth opening shape of a real person.

Similarly, in order to ensure that the cartoon display corresponding to the head of the speaker does not have serious distortion when the head of the speaker rotates left and right, the embodiment further adopts bezier curve fitting to the key points for identifying the left and right face contours among the key points obtained by the solving, so as to ensure the contour consistency of the cartoon characters to be displayed at the listening and speaking terminal side in the motion process. The polynomial curve fitting processing and the bezier curve fitting processing on the key points obtained by solving in the embodiment better ensure the naturalization of the cartoon image to be displayed on the listening and speaking terminal side in following actions and expressions.

In this embodiment, the key points after the curve fitting process may be triangulated again according to the Delaunay triangulation rule, so that a first target subdivision array corresponding to the current cartoon image frame to be displayed at the listening and speaking terminal side is formed, and in consideration of the fact that the key points forming the first target subdivision array include both the face key points and the identification key points, the first target subdivision array is equivalent to the fact that the first target subdivision array carries the character expression change information and the character posture change information, and therefore the first target subdivision array can be directly used as the current state deformation information of the speaker.

Based on the above description, the specific operation of determining the current state deformation information in S2032 of the present embodiment is equivalent to performing three sub-steps of affine transformation processing, curve fitting processing and triangulation processing.

And S204, feeding back the current state deformation information to a listening and speaking terminal so that the listening and speaking terminal determines and displays a current cartoon image frame corresponding to the speaker according to the current state deformation information.

In the person image processing method provided in the second embodiment of the present application, the determination of the current state deformation information corresponding to the speaker is specifically described in detail. In the implementation process of the current state deformation information, Kalman filtering and principal component analysis which are adopted when the current key point sequence of the current character image frame is determined effectively solve the problem of key point jitter between image frames and the problem of lattice occurrence of the obtained key points respectively; the operation of triangulation by adopting the current key point sequence comprising 68 individual face key points and 6 identification key points ensures that the face expression information of the speaker and the head rotation information of the speaker are effectively transmitted to the listening and speaking terminal side; meanwhile, after key point solving is carried out on a subdivision array formed on the basis of triangulation, curve fitting processing of a polynomial curve and a Bezier curve is also carried out, so that consistency of cartoon face outlines in cartoon images to be displayed on the listening and speaking terminal side and naturalization of expressions and actions among image frames are guaranteed; in addition, compared with the existing high-dimensional image transmission, the method and the device only transmit the state deformation information, and effectively ensure the image transmission quality of the cloud video when the network bandwidth is low.

As an optional embodiment of the second embodiment, another embodiment of the step of S202 in the second embodiment is provided, specifically, the step of S202 may be further optimized as follows: carrying out key point detection on the current human image frame to form a second collected key point sequence containing a first set number of human face key points; and performing key point smoothing processing on the second collected key point sequence to obtain a second current key point sequence of the current human image frame.

In this implementation, only the first set number of face key points are identified from the current human image frame, and then kalman filtering processing and principal component analysis processing are performed in the same manner, so as to finally obtain a second current key point sequence only including the first set number of face key points.

On the basis of the foregoing optimization in this optional embodiment, the optimization of the character image processing information includes a second character subdivision array formed on the basis of triangulation of the first set number of key points for the reference character image, and the cartoon image processing information includes a second cartoon subdivision array formed on the basis of triangulation of the first set number of key points for the reference cartoon image.

It can also be stated that the reference cartoon image and the subdivision array corresponding to the reference character image are also predetermined, and the second character subdivision array and the second cartoon subdivision array are respectively obtained in advance through triangulation on the second character key point sequence and the second cartoon key point sequence, where the second character key point sequence and the second cartoon key point sequence are both sequences with only the first set number of face key points.

Correspondingly, fig. 4 shows another implementation example of determining the current state deformation information in the person image processing method provided in the second embodiment, and as shown in fig. 4, the second embodiment further embodies the following steps of determining the current state deformation information for cartoon-displaying the speaker according to the current key point sequence and by combining the predetermined person image processing information of the reference person image and the cartoon image processing information of the reference cartoon image.

The following steps in this optional embodiment are equivalent to another preferred operation of S203, where the operation is based on the premise that the current character image frame corresponds to the second current key point sequence, the reference cartoon image corresponds to the second cartoon subdivision array, and the reference character image corresponds to the second cartoon subdivision array.

S2301, determining current pose estimation information of the speaker according to the second current key point sequence.

It can be known that, on the premise of determining key points of the face in the image, the head rigid motion parameter of the speaker in the current image frame of the person can be estimated by using the face pose estimation algorithm, that is, the head rigid motion parameter can be represented by three euler angles yaw, pitch and roll, and can be regarded as the current pose estimation information of the speaker.

S2302, triangulation is carried out on the second current key point sequence by adopting a set triangulation rule, and a second current subdivision array of the current person image frame is obtained.

The step is the same as the realization of the triangulation operation in the step S2032, but is different in that the second current keypoint sequence for triangulation only includes the first set number of face keypoints.

S2303, according to the second current subdivision array, combining the second character subdivision array and the second cartoon subdivision array, and determining a second target subdivision array required by the cartoon display speaker through affine transformation.

The step is the same as the implementation of the triangulation operation in S2033, and includes processing of three parts, i.e., affine transformation processing, curve fitting processing, and triangulation processing, except that the used subdivision array is obtained based on only the first set number of face key points.

S2304, taking the current pose estimation information and the second target subdivision array as current state deformation information of the speaker.

In this alternative embodiment, the above steps are equivalent to separately determining pose information and expression information of the speaker, so that the current pose estimation information and the second target subdivision array can be used together as the current state deformation information.

The optional embodiment is equivalent to another implementation mode for determining the current state deformation information, and the implementation mode realizes that the pose information and the expression information of the speaker are separately calculated and are used as the current state deformation information for feedback. By the method provided by the optional embodiment, the cartoon display of the speaker at the listening and speaking terminal side can be ensured, so that the leakage risk of identity information of the speaker can be reduced, and the video content activity is enhanced.

EXAMPLE III

Fig. 5 is a schematic flowchart of a character image processing method according to a third embodiment of the present application, where the method is suitable for implementing a cartoon display of a speaker at a speaker terminal side on a listening/speaking terminal side in a cloud video service. The method can be executed by computer equipment supporting cloud video service, the computer equipment is specifically regarded as a listening and speaking terminal for performing cloud video communication in the cloud video service, and the listening and speaking terminal can be composed of two or more physical test questions or a physical entity. Generally, the intercom terminal may be a notebook, a desktop computer, a smart tablet, a smart mobile terminal, or the like.

As shown in fig. 5, a method for processing a human image provided in the third embodiment specifically includes the following operations:

s301, receiving the current state deformation information fed back by the talkback terminal, and acquiring the image attribute information of the predetermined reference cartoon image.

It should be noted that, in the embodiment, the listening and speaking terminal serving as the execution main body and the main speaking terminal have cloud video communication, and can receive the current state deformation information fed back by the main speaking terminal in real time, and meanwhile, in order to implement the cartoon display of the main speaker at the main speaking terminal side on the execution main body, the embodiment also needs to acquire the reference cartoon image serving as the cartoon display carrier of the main speaker and the related image attribute information thereof.

The current state deformation information is determined by the character image processing method provided in the first embodiment or the second embodiment, and meanwhile, the reference cartoon image is selected and sent by the speaker of the speaker terminal in advance, and the image attribute information of the reference cartoon image is prestored in the set attribute information table.

In this embodiment, after the cloud video application is installed on the listening and speaking terminal as the execution subject, it is equivalent to that a cartoon image package already exists, the cartoon image package includes a plurality of pre-designed cartoon character images, and an attribute information table including corresponding image attribute information of each cartoon character image correspondingly exists. In practical application, the speaker terminal side may pre-identify the relevant logo of the reference cartoon image selected by the speaker, and the executive body may determine the matching cartoon character image based on the relevant logo, and simultaneously acquire the corresponding image attribute information from the attribute information table as the image attribute information of the reference cartoon image.

In addition, the reference cartoon image may be a previous cartoon image frame corresponding to a current cartoon image frame to be displayed, and the image attribute information of the reference cartoon image is image attribute information of the previous cartoon image frame.

It can be understood that, if the reference character image determined by the master terminal side participating in the current state deformation information is a previous character image frame relative to the current character image frame, the reference cartoon image determined by the master terminal side participating in the current state deformation information is also a previous cartoon image frame corresponding to the previous character image frame, and at this time, the image attribute information corresponding to the reference cartoon image also corresponds to the image attribute information corresponding to the previous cartoon image frame. The embodiment may determine, according to specific situations, whether the reference cartoon image frame in this step is a cartoon character image in the cartoon image packet or a previous cartoon image frame associated with a previous character image frame, and thereby obtain corresponding image attribute information.

In this embodiment, the image attribute information may be specifically understood as an information item including texture information of the image of the cartoon character and subdivision array information formed by triangulation, and both the texture information and the subdivision array may be used as precondition information for the cartoon-speaking person cartoon-speaking processing.

And S302, forming a current cartoon image frame of the speaker corresponding to the talkback terminal according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time.

According to the description of the current state deformation information in the above embodiment, it can be known that the current state deformation information includes change information of the change in expression and posture of the current person image frame corresponding to the speaker with respect to the reference person image. After the current state deformation information is received, on the premise that related image attribute information of the reference cartoon image required for the speaker cartoon display is known, the change of the reference cartoon image in expression and posture can be realized through affine transformation processing, so that a current cartoon image frame corresponding to the current character image frame is formed, and finally the current cartoon image frame can be displayed in real time.

The character image processing method provided by the embodiment of the application effectively realizes cartoon display of a speaker at the listening and speaking terminal side at the main speaking terminal side, compared with the existing method that the character image frame of the speaker is directly transmitted to the listening and speaking terminal in real time, the cartoon character image reduces the leakage risk of the identity information of the speaker, and meanwhile, the cartoon display of the speaker in cloud classroom teaching or cloud conference also has the effect of enhancing video content activity.

The following alternative embodiments of the present embodiment are equivalent to different implementation manners of determining the current cartoon image frame based on the current state information according to different determination manners of the received current state deformation information at the main speaking terminal side.

First, as an optional embodiment of the third embodiment, in the optional embodiment, the current state deformation information is further optimized to be a first target subdivision array required by the speaker cartoon display, at this time, the obtained image attribute information includes cartoon image texture information and a first cartoon subdivision array formed by triangulation of the reference cartoon image based on the fourth set number of key points.

It can be understood that the optional embodiment is equivalent to the method that the current state deformation information determined by the main speaking terminal side only includes the first target subdivision array simultaneously containing the expression information and the posture information; corresponding to the current state deformation information, the image attribute information which can be acquired in the determined reference cartoon image should be cartoon image texture information and a first cartoon subdivision array formed by triangulation of key points corresponding to a fourth set number, wherein the fourth set number is equal to the sum of the first set number and the second set number when the person image processing is carried out at the main speaking terminal side.

Correspondingly, fig. 6 is a flowchart of an implementation of cartoon display of a person in the person image processing method according to the second embodiment of the present application, and as shown in fig. 6, this optional embodiment further embodies the following steps of forming a current cartoon image frame of a speaker corresponding to the intercom terminal according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time.

The following steps in this embodiment are equivalent to a preferred operation of S302, and the operation is based on the premise that the current state deformation information includes the first target subdivision array, and the image attribute information of the reference cartoon image corresponds to the first cartoon subdivision array and the cartoon image texture information.

S3021, determining a first target affine transformation matrix of each triangular patch forming the reference cartoon image according to the first target subdivision array and the first cartoon subdivision array.

It can be known that the first target subdivision array and the first cartoon subdivision array both include a plurality of triangular patches with triangular patch information (specifically including a triangle number and coordinate values of vertices of the triangle, where the triangle number is equivalent to a row index value of the row), and triangles with the same triangle number in the first target subdivision array and the first cartoon subdivision array are used as corresponding triangular patches, and this step requires that a single triangular patch is used as a processing object, and each triangular patch is processed in the same manner.

Specifically, in this embodiment, an analysis process of affine transformation is described by taking a triangle patch in the first target subdivision array as an example:

the triangular surface patches corresponding to the first target subdivision array and the first cartoon subdivision array satisfy the following relational expression:

and (u, v) represents the coordinate value of one vertex in the currently processed triangular patch in the first target subdivision array, and (x, y) represents the coordinate value of one vertex in the corresponding triangular patch in the first cartoon subdivision array.

Therefore, the coordinate values of three vertexes in the currently processed triangular patch can be adopted to form 3 relational expressions, and the affine transformation matrix A, namely the affine transformation matrix A can be calculated by solving the 3 relational expressions

This step records this affine transformation matrix as the first target affine transformation matrix.

And S3022, forming a current cartoon image frame of the speaker based on the first target affine transformation matrixes and the cartoon image texture information, and displaying the current cartoon image frame in real time.

It is understood that, in this embodiment, the above S3021 may be used to determine the first target affine transformation matrix corresponding to each triangular patch in the first target subdivision array. In this step, texture filling may be performed through the first target affine transformation matrices and the cartoon image texture information to form a current cartoon image frame corresponding to the current character image frame.

For example, fig. 6a to 6d show implementation displays of obtaining a current cartoon image frame from a reference cartoon image in the character image processing method provided by the present embodiment. Fig. 6a is a reference cartoon image carrying a triangular patch corresponding to the relevant subdivision array, and fig. 6b is a texture map of a triangular patch selected from the reference cartoon image; fig. 6c is a texture map formed in the first target subdivision array corresponding to the filling of the triangular patch in fig. 6 b; fig. 6d shows the current cartoon image frame formed after all the triangular patches have been texture-filled.

The texture map of the triangle patch shown in fig. 6b corresponds to the texture map shown in fig. 6c, which is formed by knowing the triangle number of the triangle patch, the three vertex coordinates of the triangle, and the texture information, selecting element information (the three vertex coordinates of the triangle) corresponding to the same triangle number from the first target partition array after knowing the triangle number information, determining the three vertex information of the triangle patch shown in fig. 6c by affine transformation based on the three vertex coordinates of the two triangles, and filling the texture information of the triangle patch shown in fig. 6b into the region formed by the triangle patch shown in fig. 6 c.

It should be noted that, in the two steps in this optional embodiment, the determination of the first target affine transformation matrices of all the triangular patches may be completed based on S3021, and then texture filling of each triangular patch may be performed in a unified manner by using S3022, so as to obtain a complete current cartoon image frame; or after the first target affine transformation matrix of one triangular patch is determined by adopting the S3021, directly adopting the S3022 to perform texture filling on the corresponding triangular patch; and returning to S3021 to operate on a new triangular patch, and thus performing the two steps in a loop until the triangular patches are processed to finally obtain the current cartoon image frame with the texture filled completely.

The present embodiment is not particularly limited with respect to the specific determination manner of the current cartoon image frame.

Meanwhile, as another optional embodiment of the third embodiment, in the optional embodiment, the current state deformation information is further optimized to be the current pose estimation information of the speaker and a second target subdivision array required by cartoon display; and the correspondingly acquired image attribute information comprises cartoon image texture information and a second cartoon subdivision array formed by triangulation of the reference cartoon image based on the key points with the fourth set number.

It can be understood that the optional embodiment is equivalent to that the current state deformation information determined by the speaker terminal side includes not only the current pose estimation information representing the current pose of the speaker, but also the second target subdivision array representing the current expression of the speaker; corresponding to the current state deformation information, the image attribute information which can be acquired in the determined reference cartoon image should be cartoon image texture information and a second cartoon subdivision array formed by triangulation of key points corresponding to a fifth set number, wherein the fifth set number is equivalent to the first set number when the person image processing is carried out on the main speaking terminal side.

Correspondingly, fig. 7 is a flowchart of another implementation of cartoon display of a person in the person image processing method according to the second embodiment of the present application, and as shown in fig. 7, this optional embodiment further embodies the following steps of forming a current cartoon image frame of a speaker corresponding to the intercom terminal according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time.

The following steps in this embodiment are equivalent to another preferred operation of S302, and the operation is based on the premise that the current state deformation information includes current pose estimation information and a second target subdivision array, and the image attribute information of the reference cartoon image corresponds to the second cartoon subdivision array and cartoon image texture information.

S3201, according to the second target subdivision array and the second cartoon subdivision array, determining a second target affine transformation matrix of each triangular patch forming the reference cartoon image.

The step is the same as the implementation of the target affine transformation matrix in S3021, but is different in that the second target subdivision array and the second cartoon subdivision array for determining the target affine transformation matrix are obtained only based on the corresponding face key points of the first set number.

S3202, combining the cartoon image texture information based on the second target affine transformation matrixes to form a cartoon expression image frame of the speaker.

The operation in this step is basically the same as the operation in S3022 described above, and the texture filling is performed based on the texture information of the cartoon image, but the cartoon expression image frame formed by filling in this step only shows the expression change of the speaker, and does not show the posture change.

And S3203, forming a current cartoon image frame of the speaker according to the current pose estimation information and the cartoon expression image frame, and displaying the current cartoon image frame in real time.

The cartoon expression image frames formed in the steps are combined with the posture information, so that the posture of the cartoon characters on the cartoon expression image frames is adjusted, and the image frames after the posture adjustment can be used as the current cartoon image frames of the speaker.

In order to more intuitively explain that the speaker terminal and the listening and speaking terminal in the cloud video communication carry out cartoon display on the speaker side of the listening and speaking terminal through the character image processing method provided by the embodiment of the application. Fig. 7a shows a global effect display diagram after executing the human image processing method provided by the present application, as shown in fig. 7a, specifically including the interaction between the two terminals, namely the talking terminal 31 and the listening terminal 32.

Among them, on the side of the lecture terminal 31, there are included a reference person image determining section 310 (here, taking as an example the capturing of a front face image of a lecture person and processing) and a current person image frame processing section 311; at the reference person image determining section 310, operations such as capturing of the front face image of the speaker, detection of the key points, triangulation based on the key points, and the like are performed, and a person segmentation array (which may include a first person segmentation array and a second person segmentation array) corresponding to the reference person image, which may be acquired by the current person image frame processing section 311, may be obtained; in the current person image frame processing section 311, the operations such as the detection of key points of the speaker in the current person image frame, triangulation based on the key points, and the like are realized, the current subdivision array of the current person image frame is obtained, and the operation of obtaining the current state deformation information by affine transformation based on the person subdivision array and the current subdivision array is also realized.

In the above example, the selected reference cartoon image 321 is combined at the listening and speaking terminal 32 side, and the reference cartoon image has corresponding cartoon split arrays (which may include a first cartoon split array and a second cartoon split array) and cartoon image texture information. Then, affine transformation operation and texture filling operation are realized according to the cartoon subdivision array and the cartoon image texture information in combination with the current state deformation information of the talkback terminal 31 side, and finally, a current cartoon image frame 322 displayed on the talkback terminal side by the speaker is obtained.

In addition, this embodiment also gives an effect display diagram of a speaker cartoonizing display by combining the current state deformation information formed by separating the motion and the expression through fig. 7b, specifically, as shown in fig. 7b, at the speaker terminal side, it can respectively obtain the current pose estimation information and the target segmentation group to form the current state deformation information sent to the speaker terminal side for affine transformation of the segmentation group formed by only segmenting the face key points and pose estimation based on the face key points.

Example four

Fig. 8 is a block diagram of a personal image processing apparatus according to a fourth embodiment of the present application, the apparatus is integrated in a computer device that realizes that a speaker is used as a speaker terminal when a call-in display is performed on a listening/speaking terminal side in a cloud video service, and as shown in fig. 8, the apparatus includes: a person image acquisition module 41, an image deformation determination module 42, and a deformation information transfer module 43.

The person image acquiring module 41 is configured to intercept a current person image frame of a speaker from a current live-action image frame captured by a camera;

an image deformation determining module 42, configured to determine, according to the current person image frame in combination with a predetermined reference person image and a reference cartoon image, current state deformation information for cartoon-type display of the speaker;

and the deformation information transmission module 43 is configured to feed back the current state deformation information to a listening and speaking terminal, so that the listening and speaking terminal determines and displays a current cartoon image frame corresponding to the speaker according to the current state deformation information, where the listening and speaking terminal and the main speaking terminal maintain cloud video communication.

The figure image processing device that this application embodiment four provided, the main talkback terminal side provides state deformation information for the main talkback people in the cartoon of listening and talking terminal side demonstration, and the main talkback people has reduced the leakage risk of main talkback people's identity information in the cartoon of listening and talking terminal side demonstration, the effect of video content activeness has also been strengthened simultaneously, and compare in current high-dimensional image transmission, this scheme only transmits state deformation information, cloud video's image transmission quality when effectively having guaranteed network bandwidth is lower.

Further, the image deformation determining module 42 specifically includes:

the key point determining unit is used for detecting key points in the current human image frame to form a corresponding current key point sequence;

and the deformation information determining unit is used for determining the deformation information of the current state of the speaker for cartoon display according to the current key point sequence by combining the character image processing information of the reference character image and the cartoon image processing information of the reference cartoon image.

Further, the key point determining unit may be specifically configured to perform key point detection on the current human image frame, and form a first collection key point sequence including a first set number of human face key points and a second set number of identification key points; and performing key point smoothing processing on the first collected key point sequence to obtain a first current key point sequence of the current human image frame.

On the basis of the optimization, the character image processing information comprises a first character subdivision array formed by triangulation of the reference character image based on key points of a third set number, the cartoon image processing information comprises a first cartoon subdivision array formed by triangulation of the reference cartoon image based on key points of the third set number, and the third set number is the sum of the first set number and the second set number;

correspondingly, the deformation information determining unit can be specifically used for triangulating the first current key point sequence by adopting a set triangulation rule to obtain a first current subdivision array of the current person image frame; and according to the first current subdivision array, combining the first person subdivision array and the first cartoon subdivision array, determining a first target subdivision array required by the speaker through affine transformation, and using the first target subdivision array as the current state deformation information of the speaker.

Further, the key point determining unit may be further configured to perform key point detection on the current human image frame to form a second collected key point sequence including a first set number of human face key points; and performing key point smoothing processing on the second collected key point sequence to obtain a second current key point sequence of the current human image frame.

On the basis of the optimization, the character image processing information comprises a second character subdivision array formed by triangulation of the reference character image based on the first set number of key points, and the cartoon image processing information comprises a second cartoon subdivision array formed by triangulation of the reference cartoon image based on the first set number of key points;

correspondingly, the deformation information determining unit can be further specifically used for determining the current pose estimation information of the speaker according to the second current key point sequence; triangulation is carried out on the second current key point sequence by adopting a set triangulation rule, and a second current subdivision array of the current person image frame is obtained; according to the second current subdivision array, combining the second character subdivision array and a second cartoon subdivision array, and determining a second target subdivision array required by the cartoon display speaker through affine transformation; and taking the current pose estimation information and the second target subdivision array as the current state deformation information of the speaker.

On the basis of the optimization, the reference person image is a front face image which is acquired for the speaker through the camera in advance; alternatively, the reference human image is a human image frame preceding the current human image frame.

EXAMPLE five

Fig. 9 is a block diagram of a human image processing apparatus according to a fifth embodiment of the present application, where the apparatus is integrated in a computer device that realizes that a speaker is used as a listening/speaking terminal when the speaker is in a cartoon display at the listening/speaking terminal side in a cloud video service. As shown in fig. 9, the apparatus includes: a deformation information receiving module 51 and a cartoon image conversion module 52.

The deformation information receiving module 51 is configured to receive current state deformation information fed back by the intercom terminal, and acquire image attribute information of a predetermined reference cartoon image, where the current state deformation information is determined by the apparatus in the fourth embodiment;

and a cartoon image conversion module 52, configured to form a current cartoon image frame of the speaker corresponding to the speaker terminal according to the current state deformation information and the image attribute information, and display the current cartoon image frame in real time.

The figure image processing device that this application embodiment five provided, the effectual cartoon that has realized the main speaker of main speaker terminal side and is showing at the terminal side of listening and speaking, it compares with the current direct people image frame real-time transmission who will give the main speaker to listening and speaking terminal, the figure image of cartoon has reduced the leakage risk of main speaker identity information, the cartoon that simultaneously gives in cloud classroom teaching or the cloud meeting of main speaker shows also possesses the effect that strengthens video content activeness, furthermore, main speaker terminal only needs to feed back a state deformation information to listening and speaking terminal and just can realize the cartoon of main speaker and show, need not to transmit high-dimensional image, the image transmission quality of cloud video when effectively having guaranteed the network bandwidth lower.

Further, when the current state deformation information is a first target subdivision array required by the speaker cartoon display, the image attribute information comprises cartoon image texture information and a first cartoon subdivision array formed by triangulation of the reference cartoon image based on a fourth set number of key points; accordingly, the cartoon image conversion module 52 may be specifically configured to:

As another embodiment of this embodiment, when the current state deformation information is the current pose estimation information of the speaker and a second target subdivision array required by cartoon display, the image attribute information includes cartoon image texture information and a second cartoon subdivision array formed by triangulation of the reference cartoon image based on a fifth set number of key points; correspondingly, the cartoon image conversion module 52 may be further configured to:

On the basis of the optimization, the reference cartoon image is selected and sent by a speaker of the speaker terminal in advance, and the image attribute information of the reference cartoon image is prestored in a set attribute information table.

EXAMPLE six

Fig. 10 is a schematic diagram of a hardware structure of a main terminal according to a sixth embodiment of the present invention, specifically, the main terminal includes: a processor and a storage device. At least one instruction is stored in the storage device, and the instruction is executed by the processor, so that the computer device executes the person image processing method according to the first embodiment or the second embodiment of the method.

Referring to fig. 10, the main terminal may specifically include: a processor 60, a storage device 61, a display 62, an input device 63, an output device 64, and a communication device 65. The number of the processors 60 in the main terminal may be one or more, and one processor 60 is taken as an example in fig. 10. The number of the storage devices 61 in the main terminal may be one or more, and one storage device 61 is illustrated in fig. 10. The processor 60, the storage device 61, the display 62, the input device 63, the output device 64, and the communication device 65 of the main terminal may be connected by a bus or other means, and fig. 10 illustrates the case of connection by a bus.

The storage device 61 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the embodiments of the present invention (for example, the human image acquisition module 41, the image deformation determination module 42, the deformation information transmission module 43, and the like in the human image processing device according to the fourth embodiment). The storage device 61 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating device, an application program required for at least one function; the storage data area may store data created according to the use of the intercom terminal, and the like. Further, the storage device 61 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage device 61 may further include memory located remotely from the processor 60, which may be connected to the intercom terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In general, the display screen 62 is used for displaying data according to instructions from the processor 60, and is also used for receiving touch operations applied to the display screen 62 and sending corresponding signals to the processor 60 or other devices. Optionally, when the display screen 62 is an infrared screen, the display screen further includes an infrared touch frame, the infrared touch frame is disposed around the display screen 62, and the infrared touch frame may also be configured to receive an infrared signal and send the infrared signal to the processor 60 or other intercom terminals.

And a communication device 65 for establishing a communication connection with the other intercom terminal, which may be a wired communication device and/or a wireless communication device.

The input device 63 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the intercom terminal, and may also be a camera for acquiring images and a pick-up intercom terminal for acquiring audio in video data. Output device 64 may include a video-speaker terminal such as a display screen and an audio-speaker terminal such as a speaker. It should be noted that the specific composition of the input device 63 and the output device 64 can be set according to actual conditions.

The processor 60 executes various functional applications and data processing of the main terminal by running software programs, instructions, and modules stored in the storage device 61, that is, implements the person image processing method of the first or second embodiment.

Specifically, in the embodiment, when the processor 60 executes one or more programs stored in the storage device 61, the following operations are specifically implemented: intercepting a current person image frame of a speaker from a current live-action image frame captured by a camera; determining current state deformation information for cartoon display of the speaker according to the current character image frame and the predetermined reference character image and reference cartoon image; and feeding back the current state deformation information to a listening and speaking terminal so that the listening and speaking terminal determines and displays a current cartoon image frame corresponding to the speaker according to the current state deformation information, wherein the listening and speaking terminal and the speaker terminal are in cloud video communication.

The listening and speaking terminal provided by the embodiment can be used for executing the person image processing method provided by the embodiment one or the embodiment two, and has corresponding functions and beneficial effects.

EXAMPLE seven

Fig. 11 is a schematic diagram of a hardware structure of a listening and speaking terminal according to a seventh embodiment of the present invention, specifically, the listening and speaking terminal includes: a processor and a storage device. At least one instruction is stored in the storage device, and the instruction is executed by the processor, so that the computer device executes the person image processing method according to the first embodiment or the second embodiment of the method.

Referring to fig. 11, the listening and speaking terminal may specifically include: a processor 70, a storage device 71, a display 72, an input device 73, an output device 74, and a communication device 75. The number of the processors 70 in the listening and speaking terminal may be one or more, and one processor 70 is taken as an example in fig. 11. The number of the storage devices 71 in the listening and speaking terminal may be one or more, and one storage device 71 is taken as an example in fig. 11. The processor 70, the storage device 71, the display 72, the input device 73, the output device 74, and the communication device 75 of the listening and speaking terminal may be connected by a bus or other means, and fig. 11 illustrates the connection by a bus as an example.

The storage device 71 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the embodiments of the present invention (for example, the deformation information receiving module 51 and the cartoon image conversion module 52 in the personal image processing device provided in the fifth embodiment). The storage device 71 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating device, an application program required for at least one function; the storage data area may store data created according to the use of the listening and speaking terminal, and the like. Further, the storage 71 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage device 71 may further include a memory remotely located from the processor 70, which may be connected to a listening and speaking terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Generally, the display screen 72 is used for displaying data according to instructions of the processor 70, and is also used for receiving touch operations applied to the display screen 72 and sending corresponding signals to the processor 70 or other devices. Optionally, when the display screen 72 is an infrared screen, the display screen further includes an infrared touch frame, the infrared touch frame is disposed around the display screen 72, and the infrared touch frame may also be configured to receive an infrared signal and send the infrared signal to the processor 70 or other listening and speaking terminals.

And a communication device 75 for establishing a communication connection with other listening and speaking terminals, which may be a wired communication device and/or a wireless communication device.

The input device 73 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the listening and speaking terminal, and may also be a camera for acquiring images and a pick-up listening and speaking terminal for acquiring audio in video data. The output devices 74 may include video listening and speaking terminals such as a display screen and audio listening and speaking terminals such as speakers. It should be noted that the specific composition of the input device 73 and the output device 74 may be set according to actual conditions.

The processor 70 executes various functional applications and data processing of the listening and speaking terminal by running software programs, instructions and modules stored in the storage device 71, that is, implements the above-described person image processing method.

Specifically, in the embodiment, when the processor 70 executes one or more programs stored in the storage device 71, the following operations are specifically implemented: receiving current state deformation information fed back by the main speaking terminal, and acquiring image attribute information of a predetermined reference cartoon image, wherein the current state deformation information is determined by the method of the first embodiment or the second embodiment; and forming a current cartoon image frame of the speaker corresponding to the speaker terminal according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time.

The listening and speaking terminal provided by the third embodiment can be used for executing the character image processing method provided by the third embodiment, and has corresponding functions and beneficial effects.

Example eight

An eighth embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for processing a human image, including:

intercepting a current person image frame of a speaker from a current live-action image frame captured by a camera; determining current state deformation information for cartoon display of the speaker according to the current character image frame and the predetermined reference character image and reference cartoon image; and feeding back the current state deformation information to a listening and speaking terminal so that the listening and speaking terminal determines and displays a current cartoon image frame corresponding to the speaker according to the current state deformation information, wherein the listening and speaking terminal and the speaker terminal are in cloud video communication.

Or receiving current state deformation information fed back by the talkback terminal, and acquiring image attribute information of a predetermined reference cartoon image, wherein the current state deformation information is determined by the method of the first embodiment or the second embodiment; and forming a current cartoon image frame of the speaker corresponding to the speaker terminal according to the current state deformation information and the image attribute information, and displaying the current cartoon image frame in real time.

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the operations of the method for processing the personal image as described above, and may also perform related operations in the method for processing the personal image provided by any embodiments of the present invention, and has corresponding functions and advantages.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, where the computer software product may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to enable a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the circuit device information obtaining method in the circuit design according to any embodiment of the present invention.

It should be noted that, in the circuit device information obtaining apparatus in the circuit design, each unit and each module included in the circuit device information obtaining apparatus are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A character image processing method is applied to a speaker terminal, and comprises the following steps:

determining the current state deformation information for cartoon display of the speaker by using a triangulation method according to the current character image frame by combining a predetermined reference character image and a reference cartoon image; the current state deformation information comprises the change of the current expression of the speaker relative to the change of the expression of the speaker in the reference character image and the change of the current action relative to the change of the action of the speaker in the reference character image;

2. The method of claim 1, wherein determining the current state deformation information for cartoonizing the speaker based on the current human image frame in combination with a predetermined reference human image and a reference cartoon image comprises:

3. The method of claim 2, wherein said detecting the keypoints in the current human image frame to form a corresponding current sequence of keypoints comprises:

4. The method of claim 3, wherein the character image processing information comprises a first number of cartoon images based on a third set number of keypoint triangulation of the reference character image, the cartoon image processing information comprises a first number of cartoon images based on a third set number of keypoint triangulation of the reference cartoon image, the third set number being a sum of the first set number and a second set number;

5. The method of claim 2, wherein said detecting the keypoints in the current human image frame to form a corresponding current sequence of keypoints comprises:

6. The method of claim 5, wherein the character image processing information comprises a second character subdivision array formed based on a first set number of keypoint triangulation for the reference character image, and wherein the cartoon image processing information comprises a second cartoon subdivision array formed based on a first set number of keypoint triangulation for the reference cartoon image;

7. The method according to any one of claims 1 to 6, wherein the reference person image is a front face image previously captured for the speaker by the camera; alternatively, the reference human image is a human image frame preceding the current human image frame.

8. A character image processing method is applied to a listening and speaking terminal and comprises the following steps:

receiving current state deformation information fed back by a main speaking terminal, and acquiring image attribute information of a predetermined reference cartoon image, wherein the current state deformation information is determined by the method of any one of claims 1-7;

9. The method according to claim 8, wherein when the current state deformation information is a first target subdivision array required for the speaker cartoon display, the image attribute information includes cartoon image texture information and a first cartoon subdivision array formed by triangulation of the reference cartoon image based on a fourth set number of key points;

10. The method according to claim 8, wherein when the current state deformation information is current pose estimation information of the speaker and a second target partition array required for cartoon display, the image attribute information includes cartoon image texture information and a second cartoon partition array formed based on a fifth set number key point triangulation of the reference cartoon image;

11. The method according to any one of claims 8 to 10, wherein the reference cartoon image is selected and sent by a speaker of the speaker terminal in advance, and the image attribute information of the reference cartoon image is prestored in a set attribute information table;

or the reference cartoon image is a previous cartoon image frame relative to a current cartoon image frame to be displayed, and the image attribute information of the reference cartoon image is the image attribute information of the previous cartoon image frame.

12. A personal image processing apparatus, comprising:

the image deformation determining module is used for determining current state deformation information for cartoon display of the speaker by using a triangulation method according to the current character image frame by combining a predetermined reference character image and a reference cartoon image, wherein the current state deformation information comprises the change of the current expression of the speaker relative to the change of the expression of the speaker in the reference character image and the change of the current action relative to the change of the action of the speaker in the reference character image;

13. A personal image processing apparatus, comprising:

a deformation information receiving module, configured to receive current state deformation information fed back by the intercom terminal, and acquire image attribute information of a predetermined reference cartoon image, where the current state deformation information is determined by the apparatus of claim 12;

14. A terminal for intercom, comprising:

a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-7.

15. A listening and speaking terminal, comprising:

a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 8-11.

16. A storage medium containing computer-executable instructions for performing the method steps of any one of claims 1-11 when executed by a computer processor.