CN114419177A

CN114419177A - Personalized expression package generation method and system, electronic equipment and readable medium

Info

Publication number: CN114419177A
Application number: CN202210018166.0A
Authority: CN
Inventors: 单志辉; 孙环荣; 宫新伟; 陈兆金; 徐灵敏; 赵世亭
Original assignee: Shanghai Xunze Network Technology Co ltd
Current assignee: Shanghai Xunze Network Technology Co ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-29

Abstract

The invention discloses a method and a system for generating a personalized expression package, electronic equipment and a readable medium, wherein the method comprises the following steps: s100, when an expression package tool triggers the production configuration of an expression package, receiving a plurality of source images with target faces, wherein each target face has different expression action information; s200, performing expression action-related feature extraction on the target face in each source image by using a preset feature label model, generating expression feature libraries related to each expression action, and performing semantic labeling on each expression feature library to obtain a plurality of expression feature libraries with semantic results; s300, receiving the expression feature information in each expression feature library by using a feature fusion model, fusing a plurality of expression feature libraries with semantic results into a personalized expression package of the target face after VIT visual processing, and displaying the corresponding expression action images of the target face according to the semantic results. The invention can display the expression and action image matched with the semantic result by directly utilizing the semantic result in the input content.

Description

Personalized expression package generation method and system, electronic equipment and readable medium

Technical Field

The invention relates to the technical field of computer vision and deep learning, in particular to a method and a system for generating a personalized expression package, electronic equipment and a readable medium.

Background

With the continuous development of social networks, people gradually evolve from text communication to communication using symbols, images, emoticons and the like. Because the emoticon can enrich the chat content of the user, make up the advantages of boring character communication, inaccurate meaning representation and the like, and improve the communication efficiency better, for example, the emotion which is difficult to describe through the characters is expressed by the user through the emoticon information in the emoticon. Many expression packages using face images as materials exist at present, but most of the expression packages are the existing expression package materials stored by users, cannot be identified and defined according to the face images, and need the users to define the face images.

Disclosure of Invention

The embodiment of the application provides a method, a system, an electronic device and a readable medium for generating a personalized expression package, so that the defects that in the prior art, an expression is hard to synthesize and manufactured, and a user needs to select an expression action to be displayed are overcome, and the beneficial effect that an expression action image matched with a semantic result can be displayed by utilizing the semantic result contained in input content is achieved.

In a first aspect, an embodiment of the present application provides a method for generating a personalized facial expression package, where the method includes:

s100, when an expression package tool triggers the production configuration of an expression package, receiving a plurality of source images with target faces, wherein each target face has different expression action information;

s200, performing expression action-related feature extraction on the target face in each source image by using a preset feature label model, generating expression feature libraries related to each expression action, and performing semantic labeling on each expression feature library to obtain a plurality of expression feature libraries with semantic results;

s300, receiving the expression feature information in each expression feature library by using a feature fusion model, fusing a plurality of expression feature libraries with semantic results into an individualized expression package of the target face after VIT visual processing, and displaying corresponding target face expression action images according to the semantic results.

Further, after the step S300, the method further includes storing the personalized expression package of the target face generated by fusion, so that when the expression package tool triggers expression package application configuration, after a semantic result related to the personalized expression package is analyzed from an input content on a screen, the personalized expression package of the target face is called from the expression library, and a target facial expression action image matched with the semantic result is displayed.

Further, after the step S300, when the emoticon application configuration is triggered in response to the emoticon tool, receiving text content input on a screen, analyzing one semantic result with the personalized emoticon in the text content, triggering and calling the personalized emoticon in the emoticon library, and displaying an emoticon image matched with the semantic result.

Further, in the step S200, the feature label model performs feature extraction related to expression actions by using a deep learning technique, generates an expression feature library related to each expression action, and performs semantic labeling related to the expression actions on each expression feature library.

Further, the feature labeling model adopts a convolutional neural network for feature extraction, which includes:

performing convolution down-sampling processing on each source image for multiple times to complete feature extraction;

performing convolution down-sampling processing on the target face in each source image for multiple times to complete key point feature extraction;

and after carrying out convolution down-sampling processing on each target face for multiple times, finishing feature contour extraction related to expression and action.

Further, in step S300, the method for fusing a plurality of facial expression feature libraries with semantic results into a personalized facial expression package of the target face includes:

s310, obtaining an expression feature library subjected to convolution down-sampling for multiple times;

s320, merging the feature information after convolution down-sampling processing in different expression feature libraries each time, inputting the merged feature information into a pre-constructed VIT network architecture, and up-sampling to obtain a first fusion feature;

s330, after the first fusion features of different times are spliced according to channels, convolution up-sampling is carried out to obtain second fusion features;

s340, splicing the first fusion feature and the second fusion feature again according to the channel to obtain a third fusion feature;

and S350, performing convolution up-sampling on the third fusion feature for multiple times to obtain a fused personalized expression package.

Further, after step S340, obtaining a target face image of the desired expression action by convolution down-sampling is also included.

Further, in the VIT network architecture, the combined feature information is received, the feature information with the semantic mark is subjected to blocking processing, and a plurality of feature image blocks are obtained;

and performing linear transformation on each characteristic image block, realizing characteristic image block embedding through dimension reduction processing, and inputting the characteristic image blocks into a transform encoder in sequence after spatial position coding is performed on the characteristic image block embedding so as to perform fusion of various action expression characteristics.

In a second aspect, an embodiment of the present application provides a personalized facial expression package generating system, where the method described in any one of the first aspects is adopted, and the system includes:

the system comprises an image receiving module, a display module and a display module, wherein the image receiving module is configured to respond to an expression package tool to trigger expression package manufacturing configuration, and receive a plurality of source images with target faces, wherein each target face has different expression action information;

the semantic labeling module is configured to extract features related to expression actions of the target face in the source images by using a preset feature labeling model, generate expression feature libraries related to the expression actions, and perform semantic labeling on the expression feature libraries to obtain a plurality of expression feature libraries with semantic results;

and the expression generation module is configured to receive expression feature information in each expression feature library by using a feature fusion model, fuse the plurality of expression feature libraries with semantic results into an individualized expression package of the target face after VIT visual processing, and display a corresponding target face expression action image according to the semantic results.

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as in any one of the first aspects.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method according to any one of the first aspect.

The technical scheme provided in the embodiment of the application has at least the following technical effects:

1. compared with the traditional expression image synthesis processing technology, the method and the device have the advantages that the image copy 'hard and hard' nesting is replaced by the fusion technology, in the traditional synthesis technology, the facial image is directly copied on the needed expression image or expression images with different styles are selected, in the embodiment, after the expression characteristic image is extracted, the VI T (VI s i on Transformer) technology is deeply learned after the facial image is divided into blocks and crushed, an integral personalized expression package with multi-expression actions is formed, and the expression needed by a user is automatically generated and displayed according to the display requirement.

2. According to the method and the device, a natural language processing technology is combined, the expression characteristic image is subjected to semantic annotation to be provided with corresponding semantic information and enters a VI T network architecture as a one-dimensional input sequence for learning and training, and the expression characteristic image with a semantic result is formed, so that when a user needs the corresponding expression action image, the expression action image corresponding to the semantic result can be directly displayed according to the input content on a screen without selecting from a plurality of expression packages. That is to say, the personalized expression package after being made can present the corresponding expression action image only by inputting the related expression keywords, such as inputting the 'Duzui', and the system returns to generate the 'Duzui' expression of the user input face. Of course, the expression motion image in this embodiment may be a still image or a moving image. The generation configuration of the personalized expression package in the embodiment can customize the facial image to generate the action expression image in the required expression package, for example, the expression image of the user or collected public characters is made into the expression package, so that the affinity and the interestingness are increased.

Drawings

Fig. 1 is a flowchart of a method for generating a personalized facial expression package according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a fusion algorithm in the first embodiment of the present application;

fig. 3 is a block diagram of a personalized facial expression package generating system according to a second embodiment of the present application.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Example one

Referring to fig. 1-2, an embodiment of the present application provides a method for generating a personalized facial expression package, where the method includes:

step S100, when the expression package manufacturing configuration is triggered in response to an expression package tool, receiving a plurality of source images with target faces, wherein each target face has different expression action information.

The expression package tool in the personalized expression package generation method in the embodiment is loaded in a user terminal, the user terminal may be a smart phone, a tablet computer, or the like, and the loading mode may be an independent APP application program, a wechat applet, or a plug-in embedded into an application program such as an input method. And triggering and opening a source image receiving configuration when the expression package tool triggers the expression package making configuration function, wherein the source image can be picture information stored in the user terminal or picture information acquired by a camera on the current user terminal. For example, when the emoticon manufacturing configuration is triggered to be opened, a camera on the user terminal is called, and image acquisition of different emotions is performed according to the number of required source images, such as emotions like laughing, smiling, head bending, mouth pounding, crying and the like.

Step S200, performing expression action-related feature extraction on the target face in each source image by using a preset feature label model, generating expression feature libraries related to each expression action, and performing semantic labeling on each expression feature library to obtain a plurality of expression feature libraries with semantic results.

In the step S200, the feature label model performs feature extraction related to expression actions by using a deep learning technique, generates an expression feature library related to each expression action, and performs semantic labeling related to expression actions on each expression feature library.

The feature labeling model adopts a convolutional neural network for feature extraction, and comprises the following steps: performing convolution down-sampling processing on each source image for multiple times to complete feature extraction; performing convolution down-sampling processing on the target face in each source image for multiple times to complete key point feature extraction; and after carrying out convolution down-sampling processing on each target face for multiple times, finishing feature contour extraction related to expression and action.

Further explaining, extracting the characteristics of each source image, and marking the processing result of each convolution down-sampling as Si after carrying out convolution down-sampling processing for multiple times; extracting key point features of the target face in each source image, and marking the processing result of each convolution down-sampling as Li after carrying out convolution down-sampling processing for multiple times; extracting characteristic contours related to expression and actions in each target face, and marking the processing result of each convolution down-sampling as ei after carrying out convolution down-sampling processing for multiple times; and acquiring an expression feature library comprising Si, Li and ei.

In this embodiment, the number of times of convolution down-sampling processing is defined as 5, and Si ═ S0, S1, S2, S3, S4, Li ═ L0, L1, L2, L3, L4, ei ═ e0, e1, e2, e3, e 4.

That is, the source image is downsampled five times by convolution, and the results of each downsampling by convolution are labeled S0, S1, S2, S3, S4 in order. Extracting 68 key points on the target face; and extracting the eye and mouth regions of the target face. Taking the expression action of "beep mouth, smile, and head-skew" as an example, five convolution and down-sampling are performed, and the results of each convolution and down-sampling are labeled as L0, L1, L2, L3, and L4 in order. And extracting the outlines of the eyes and the mouth of the target face, and showing the outlines of the eyes and the mouth which can be bleb, smile and askew. And five convolution and downsampling are carried out on the data, and the results of each convolution and downsampling are marked as e0, e1, e2, e3 and e4 in sequence.

And step S300, receiving the expression feature information in each expression feature library by using a feature fusion model, fusing a plurality of expression feature libraries with semantic results into an individualized expression package of the target face after VIT visual processing, so as to display corresponding target face expression action images according to the semantic results.

Step S300 is followed by storing the personalized expression package of the target face generated by fusion, so that when the expression package tool triggers expression package application configuration, after parsing a semantic result related to the personalized expression package from the input content on the screen, the personalized expression package of the target face is called from the expression library, and a target facial expression action image matched with the semantic result is displayed.

In step S300, the method for fusing a plurality of expression feature libraries with semantic results into a personalized expression package of a target face includes:

and step S310, obtaining the expression feature library after convolution and downsampling for many times.

And step S320, merging the feature information after convolution down-sampling processing in different expression feature libraries, inputting the merged feature information into a pre-constructed VIT network architecture, and performing up-sampling to obtain a first fusion feature.

And step S330, splicing the first fusion features of different times according to the channels, and performing convolution and upsampling to obtain second fusion features.

And S340, splicing the first fusion feature and the second fusion feature again according to the channel to obtain a third fusion feature.

Further, the method for performing fusion processing on the features extracted after the 5 times of convolution downsampling processing includes: merging the { S4, L4 and E4}, inputting the merged data into a VIT network architecture, and then performing upsampling to obtain a first fusion characteristic E1; merging the { S3, L3 and E3} and inputting the merged data into a ViT network architecture to obtain a first converged feature E2; splicing the first fusion feature E1 and the first fusion feature E2 according to channels, and performing convolution upsampling to obtain a second fusion feature E3; merging the { S2, L2 and E2}, inputting the merged data into a ViT network architecture to obtain a first fused feature E4, and splicing a second fused feature E3 and the first fused feature E4 according to channels to obtain a third fused feature; and performing convolution upsampling on the third fusion characteristic for multiple times to obtain a fused personalized expression package. In this embodiment, three times of convolution upsampling is performed on the third fused feature. The obtained personalized facial expression package may show expressive actions like beeping, smiling, head-shaking, and only one expressive action at a time.

After step S340, the method further includes obtaining a target face image of the desired expression action by convolution down-sampling. That is, in order to apply the personalized expression package, a target face image of the target expression action needs to be acquired through a reverse operation of the fusion operation.

Further, in the VIT network architecture, the combined feature information is received, the feature information with the semantic mark is subjected to blocking processing, and a plurality of feature image blocks are obtained; and performing linear transformation on each characteristic image block, realizing characteristic image block embedding through dimension reduction processing, and inputting the characteristic image blocks into a transform encoder in sequence after spatial position coding is performed on the characteristic image block embedding so as to perform fusion of various action expression characteristics.

In the VIT network architecture, based on a transform technology, image information in an expression feature library is divided into image blocks patches, position embedding imbedding is carried out on the expression feature information while linear embedding is carried out, and the position embedding imbedding is carried out on the expression feature information and is input into a standard transform encoder in sequence. To further illustrate, in this embodiment, the standard Transformer encoder receives a semantic result token of the one-dimensional sequence as an input, and in order to process the two-dimensional expression feature image, the expression feature image is expressed as an input

Deformed into a series of flattened two-dimensional feature image blocks, represented as

Where (H, W) denotes the resolution of the original image, p²Representing the resolution of each characteristic image block. N ═ HW/p²Is the effective sequence length of the VIT network structure. The VIT network architecture uses the same width at all layers, i.e., each vectorized feature image block is mapped onto a model dimension for a trainable linear projection, and the corresponding output feature image block is embedded. The following equation is expressed:

embedding in a sequence of feature image blocks

Previously, a deep learning embedding was added, in a transform encoder

The states in the output may be represented as an image y, as in the formula:

in the pre-training and fine-tuning stage, the classification head is attached to

The Transformer encoder consists of multiple interactive layers of multi-headed attention (MSA) and MLP, each feature image block preceded by layerorm (ln), and residual concatenation applied after each block. MLP contains two layers that exhibit GELU nonlinearity, as shown by the formula: z' of_l＝MSA(LN(z_l-1))+z_l-1,l＝1...L；z_l＝MLP(LN(z`_l))+z`_lL is 1.

It can be seen that, compared with the conventional expression image synthesis processing technology, in the embodiment, the image replication 'hard and hard' nesting is replaced by the fusion technology, in the conventional synthesis technology, the facial image is directly replicated on the required expression image or expression images with different styles are selected, in the embodiment, after the expression feature image is extracted, the blocks are kneaded and smashed, the deep learning vit (vision transform) technology is performed, an integral personalized expression package with multi-expression actions is formed, and the expression required by the user is automatically generated and displayed according to the display requirement.

By combining a natural language processing technology, semantic annotation is carried out on expression characteristic images, so that the expression characteristic images are provided with corresponding semantic information and used as one-dimensional input sequences to enter a VI T network architecture for learning and training, and expression characteristic images with semantic results are formed. That is to say, the personalized expression package after being made can present the corresponding expression action image only by inputting the related expression keywords, such as inputting the 'Duzui', and the system returns to generate the 'Duzui' expression of the user input face. Of course, the expression motion image in this embodiment may be a still image or a moving image. The generation configuration of the personalized expression package in the embodiment can customize the facial image to generate the action expression image in the required expression package, for example, the expression image of the user or collected public characters is made into the expression package, so that the affinity and the interestingness are increased.

Example two

Referring to fig. 3, the present embodiment provides a personalized expression package generating system, which employs the method according to any one of the embodiments, and the system includes:

the image receiving module 100 is configured to receive a plurality of source images with target faces when an expression package tool triggers an expression package making configuration, wherein each target face has different expression action information.

The semantic labeling module 200 is configured to perform expression and motion related feature extraction on the target face in each source image by using a preset feature label model, generate an expression feature library related to each expression motion, and perform semantic labeling on each expression feature library to obtain a plurality of expression feature libraries with semantic results.

The expression generation module 300 is configured to receive expression feature information in each expression feature library by using a feature fusion model, and fuse a plurality of expression feature libraries with semantic results into an individualized expression package of a target face after VIT visual processing, so as to display a corresponding target face expression action image according to the semantic results.

EXAMPLE III

The embodiment provides an electronic device, including: one or more processors; the memory is used for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of embodiment one.

The present embodiment provides a computer-readable medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for generating a personalized expression package, the method comprising:

2. The method for generating the personalized expression package according to claim 1, further comprising, after the step S300, storing the personalized expression package of the target face generated by fusion, so that when the expression package tool triggers expression package application configuration, after parsing a semantic result related to the personalized expression package from on-screen input content, calling the personalized expression package of the target face from the expression library, and displaying a target facial expression action image matched with the semantic result.

3. The method for generating the personalized expression package according to claim 1, wherein after the step S300, when an expression package application configuration is triggered in response to an expression package tool, receiving text content input on a screen, analyzing one semantic result with the personalized expression package in the text content, triggering and calling the personalized expression package in the expression library, and displaying an expression action image matched with the semantic result.

4. The method as claimed in claim 1, wherein in step S200, the feature label model performs feature extraction related to expression and action by using a deep learning technique, generates an expression feature library related to each expression and action, and performs semantic labeling related to expression and action on each expression feature library.

5. The method of generating personalized expression packages according to claim 4, wherein the feature labeling model employs a convolutional neural network for feature extraction, which comprises:

6. The method for generating personalized expression packages according to claim 5, wherein in step S300, the method for fusing a plurality of expression feature libraries with semantic results into the personalized expression package of the target face comprises:

7. The method of generating a personalized expression package according to claim 6, further comprising, after the step S340, obtaining the target facial image of the desired expression action by convolution down-sampling.

8. The method for generating personalized expression packages according to claim 6, wherein in the VIT network architecture, the combined feature information is received, the feature information with semantic tags is subjected to blocking processing, and a plurality of feature image blocks are obtained;

9. A system for generating a personalized expression package, using the method of any one of claims 1 to 8, the system comprising:

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.

11. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.