CN117576245B

CN117576245B - Method and device for converting style of image, electronic equipment and storage medium

Info

Publication number: CN117576245B
Application number: CN202410055822.3A
Authority: CN
Inventors: 刘艺; 蓝玮毓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-05-07
Anticipated expiration: 2044-01-15
Also published as: CN117576245A

Abstract

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for converting a style of an image, an electronic device, and a storage medium, where the method includes: acquiring a style indication image of a target style and an original image associated with face annotation data, adopting a style conversion model, extracting style image characteristics based on the style indication image, extracting original image characteristics based on the original image, and reversely generating the target image of the target style based on a fusion result of the style image characteristics and the original image characteristics; the face image with the target style is generated in a content area corresponding to the face annotation data in the target image; the style conversion model is constructed based on a target generation network obtained through training. In this way, multiplexing of the marked original image is achieved, and complex image data collection processes for the target style image are avoided.

Description

Method and device for converting style of image, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for converting a style of an image, an electronic device, and a storage medium.

Background

Under the related technology, when meeting the requirement of recognizing the cartoon face in the follow-up image, the large difference between the cartoon face and the real face is considered, and the cartoon image and the real image belong to different image styles, so that the cartoon face detection model needs to be trained in a targeted manner so as to detect the cartoon face in the follow-up image.

At present, in the process of training a cartoon face detection model, a large number of cartoon images with corresponding styles are generally required to be obtained according to the cartoon styles to be detected and used as sample images, and corresponding cartoon face labeling data are configured for each sample image to obtain corresponding training samples.

However, since the number of cartoon images of the cartoon style is usually limited, the number of sample images meeting the model training needs is difficult to collect, and a great amount of time and cost are required for manually labeling the sample data, so that the training difficulty and the training cost of the cartoon face detection model are greatly increased, the development of the cartoon face detection technology is limited, and the training efficiency of the cartoon face detection model is reduced.

Disclosure of Invention

The embodiment of the application provides a style conversion method, a device, electronic equipment and a storage medium for images, which are used for converting and generating images with appointed styles and improving training efficiency of a model for carrying out face detection on the images with appointed styles.

In a first aspect, a method for converting a style of an image is provided, including:

acquiring a style indication image of a target style;

Acquiring an original image associated with face annotation data, wherein the face annotation data is used for identifying the position of a face in the original image;

Extracting style image characteristics based on the style indication image by adopting a style conversion model, extracting original image characteristics based on the original image, and reversely generating a target image of the target style based on a fusion result of the style image characteristics and the original image characteristics;

the face image with the target style is generated in a content area corresponding to the face annotation data in the target image; and the style conversion model is constructed based on the target generation network obtained by training after the countermeasure training of the initial generation network and the initial discrimination network is completed.

In a second aspect, an image style conversion device is provided, including:

The first acquisition unit is used for acquiring a style indication image of a target style;

The second acquisition unit is used for acquiring an original image associated with face annotation data, wherein the face annotation data are used for identifying the position of a face in the original image;

The generating unit is used for adopting a style conversion model, extracting style image characteristics based on the style indication image, extracting original image characteristics based on the original image, and reversely generating a target image of the target style based on a fusion result of the style image characteristics and the original image characteristics;

Optionally, the device further includes a training unit, and the style conversion model is obtained by the training unit in the following manner:

obtaining each training sample, wherein one training sample comprises the following steps: a sample style image of the target style, a other style image than the target style, a sample image, face annotation data of the sample image, and a face image cut out from the sample image according to the face annotation data;

and carrying out multiple rounds of alternate training on the built initial generation network and the built initial discrimination network according to the training samples until a preset convergence condition is met, obtaining a target generation network based on the initial generation network training, and building a style conversion model based on the target generation network.

Optionally, each training sample is generated by the training unit in the following manner:

Obtaining original samples for real face detection, wherein one original sample comprises the following steps: one sample image, and face annotation data of the one sample image;

For each original sample, the following is performed: cutting at least one face image from the sample image according to the face annotation data of the sample image in the original sample, acquiring a sample style image of a target style and other style images which are not of the target style, and forming at least one training sample based on the original sample, the at least one face image, the sample style image and the other style images.

Optionally, during a round of training for the initial generation network, the training unit is configured to perform the following operations:

generating a predicted image based on the read sample style image and the sample image and generating a predicted face image based on the read sample style image and the face image by adopting the initial generation network;

for the predicted image, the sample style image and other style images in the training sample, adopting a style discrimination sub-network in the initial discrimination network to output a style discrimination result, and adopting an authenticity discrimination sub-network in the initial discrimination network to output an authenticity discrimination result;

cutting out a prediction subgraph from the prediction image based on the face labeling data of the sample image, and adjusting network parameters of the initial generation network based on pixel value differences between the prediction subgraph and the prediction face image and result differences between the style discrimination result and the true discrimination result respectively and the true discrimination result.

Optionally, during a round of training for the initial discrimination network, the training unit is configured to perform the following operations:

and adjusting network parameters of the initial discrimination network based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result and the characteristic difference of the image characteristics of the appointed layer extracted by the predicted image, the sample style image and other style images respectively in the initial discrimination network.

Optionally, when the network parameters of the initial discrimination network are adjusted based on the result difference between the style discrimination result and the true-false discrimination result and the corresponding true discrimination result, and the feature difference of the image features of the designated layer extracted by the predicted image, the sample style image and the other style images, respectively, in the initial discrimination network, the training unit is configured to:

Based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and based on a designated network layer in the true discrimination sub-network, corresponding to the predicted image, the sample style image and other style images, respectively outputting characteristic differences among the image characteristics, and adjusting network parameters of the true discrimination sub-network;

And adjusting network parameters of the style discrimination sub-network based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and based on the appointed network layer in the style discrimination sub-network, corresponding to the feature difference among the image features respectively output by the prediction image, the sample style image and other style images.

Optionally, when generating the predicted image based on the read sample style image and the sample image, the training unit is configured to:

extracting style image characteristics based on sample style images in the read training samples;

extracting sample image characteristics based on sample images in the training samples;

And reversely generating a predicted image based on the fusion result of the style image characteristics and the sample image characteristics.

Optionally, after the generating the target image of the target style in the reverse direction, the generating unit is further configured to:

determining the face annotation data of the original image as the face annotation data of the target image, and constructing a training sample of the target style based on the target image and the face annotation data thereof;

And carrying out multiple rounds of iterative training on the constructed initial face detection model by adopting the training sample of the target style until a preset convergence condition is met, so as to obtain the target face detection model.

Optionally, after the outputting the trained target face detection model, the generating unit is further configured to:

Acquiring a video frame sequence of the target style;

Adopting the target face detection model, respectively carrying out face detection processing on each video frame in the video frame sequence, and identifying face area information corresponding to each video frame;

And selecting a target video frame with the face state meeting the set condition from the video frames based on the face region information corresponding to each video frame.

In a third aspect, an electronic device is presented comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the computer program.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, implements the above method.

In a fifth aspect, a computer program product is proposed, comprising a computer program which, when executed by a processor, implements the above method.

The application has the following beneficial effects:

The application provides a style conversion method, a device, electronic equipment and a storage medium of an image, and discloses a style indication image for acquiring a target style; acquiring an original image associated with face annotation data, wherein the face annotation data is used for identifying the position of a face in the original image; then, a style conversion model is adopted, based on the style indication image, style image characteristics are extracted, based on the original image, original image characteristics are extracted, and based on the fusion result of the style image characteristics and the original image characteristics, a target image of the target style is reversely generated; the face image with the target style is generated in a content area corresponding to the face annotation data in the target image; and the style conversion model is constructed based on the target generation network obtained by training after the countermeasure training of the initial generation network and the initial discrimination network is completed.

In this way, by means of the style conversion model, the original image associated with the face annotation data can be subjected to style conversion processing according to the target style indicated by the style indication image, so that a target image is obtained, and in the target image, a face image with the target style is generated in a content area corresponding to the face annotation data of the original image; based on the method, the face labeling data of the original image can be directly used as the face labeling data of the target image because the face image of the target style is generated in the content area corresponding to the face labeling data, so that no additional labeling is needed for the target image, a complex data labeling process is avoided, the multiplexing of the labeled original image is realized, and the complex image data collection process is avoided for the image of the target style;

In addition, by means of the style conversion capability learned by the style conversion model, the original image can be converted into the target image of the target style, so that the generation efficiency of the image of the target style is improved, based on the generation efficiency, the rapid generation of a training sample can be realized when the face detection model for detecting the face of the target style is trained, the training efficiency of the model can be improved, and the training cost of the model can be reduced; moreover, the method can respond to the change of the image style in time to generate the image in the corresponding style, so that the actual business needs can be better met.

Drawings

Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application;

FIG. 2 is a schematic diagram of a process for obtaining a style conversion model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a process for constructing a training sample according to an embodiment of the present application;

fig. 4 is a schematic diagram of a network structure of an initially generated network constructed in an embodiment of the present application;

FIG. 5A is a schematic diagram of an iterative training process for an initially generated network in accordance with an embodiment of the present application;

FIG. 5B is a schematic diagram illustrating the input of the initial generation network during one forward propagation in an embodiment of the present application;

FIG. 5C is a schematic diagram of two forward propagation steps in a round of iterative training in accordance with an embodiment of the present application;

FIG. 5D is a schematic diagram of three forward propagation steps in a round of iterative training in accordance with an embodiment of the present application;

fig. 5E is a schematic diagram of a network structure of a grid discrimination sub-network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a style conversion process of an image according to an embodiment of the present application;

FIG. 7A is a schematic diagram of a training process for obtaining a style conversion model according to an embodiment of the present application;

FIG. 7B is a schematic diagram of a process for training a face detection model based on an output result of a style conversion model in an embodiment of the present application;

FIG. 7C is a schematic diagram of a process for image processing based on a cartoon face detection model in an embodiment of the application;

FIG. 8 is a schematic diagram of a logical structure of an image style conversion device according to an embodiment of the present application;

Fig. 9 is a schematic diagram of a hardware composition structure of an electronic device to which the embodiment of the present application is applied;

fig. 10 is a schematic diagram of a hardware composition structure of another electronic device to which the embodiment of the present application is applied.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be capable of operation in sequences other than those illustrated or otherwise described.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision technology (CV): the computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important transformation for the development of computer vision technology, and pre-trained models in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Style conversion (STYLE TRANSFER): the application relates to a computer vision technology, in the embodiment of the application, the content and style of two images are combined to generate a new image with unique visual effect. The application realizes the style conversion function by means of the style conversion model, and in the concrete application process of the style conversion, two input images exist: one as an original Image of a Content Image (Content Image) and the other as a style indication Image of a style Image (STYLE IMAGE), wherein the Content Image provides referenceable theme and structure and the style Image provides referenceable color, texture and artistic style; typically, the output image generated after style conversion will preserve the main structure and objects of the content image while employing the artistic style of the style image.

Face Detection (Face Detection): is a subtask in the field of computer vision, whose goal is to automatically locate and recognize faces in digital images or video frames. Face detection algorithms typically output a rectangular bounding box representing the face position found in the image. This is the first step of face-related processing such as face recognition, face attribute analysis, expression recognition, and the like.

Generating an antagonism network (GENERATIVE ADVERSARIAL Networks, GAN): is a deep learning model, which consists of two models of a generator and a discriminator. The generator is used for generating false data similar to the real data, and the discriminator is used for distinguishing the real data from the false data. The generator and the discriminator learn each other in a countermeasure mode, and finally the generator can generate high-quality false data, and the discriminator can also distinguish true and false data more accurately. GAN has wide application in the fields of image generation, speech synthesis, natural language processing, and the like.

The following briefly describes the design concept of the embodiment of the present application:

At present, in order to realize detection of cartoon faces, a cartoon face detection model is usually trained in a targeted manner, so that a large number of cartoon images with corresponding styles are required to be acquired and used as sample images, and the sample images are labeled to obtain training samples.

However, since the number of cartoon images is very limited and difficult to manufacture, the collection of cartoon face data is very difficult, and the number of sample images meeting the model training requirement is difficult to collect; moreover, the time consumed by the face labeling is long, so that the training difficulty and the training cost of the cartoon face detection model are greatly increased; in addition, along with the variation of the cartoon style, the sample image is required to be continuously updated, so that the manufacturing cost of the training sample is increased, and the further development of the cartoon face detection technology is limited.

In view of the above, the application provides a method, a device, an electronic device and a storage medium for converting a style of an image, and discloses a style indication image for acquiring a target style; acquiring an original image associated with face annotation data, wherein the face annotation data is used for identifying the position of a face in the original image; then, a style conversion model is adopted, based on a style indication image, style image characteristics are extracted, based on an original image, original image characteristics are extracted, and a target image of a target style is reversely generated based on a fusion result of the style image characteristics and the original image characteristics; the method comprises the steps of generating a face image with a target style in a content area corresponding to face annotation data in a target image; the style conversion model is built based on the target generation network obtained by training after the countermeasure training of the initial generation network and the initial discrimination network is completed.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and that the embodiments of the present application and the features of the embodiments may be combined with each other without conflict.

Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application. The application scenario diagram includes a client device 110, and a processing device 120.

In some possible embodiments of the present application, the processing device 120 may train to obtain a style conversion model, and train the initial face detection model according to the target image of the target style generated by the style conversion model to obtain a target face detection model; further, the processing device 120 may perform face detection processing based on the image for which the recognition request is directed in response to the recognition request triggered by the related object on the terminal device 110, and detect the face region information corresponding to the image.

In other possible embodiments of the present application, the processing device 120 may train to obtain a style conversion model, generate images of various styles according to the style conversion model, and train to obtain a target face detection model according to images of various styles, where one style conversion model can generate a target image of a corresponding style according to an input style indication image of different styles; further, the processing device 120 may determine, in response to an identification request triggered by the related object on the terminal device 110, an image style for which the identification request is directed; and then adopting a target face detection model to perform face detection processing on the image aimed at by the identification request, and detecting face region information corresponding to the image.

The identification request triggered by the related object may be initiated on any one of an applet application, a client application and a web application, which is not particularly limited in the present application.

Client devices 110 include, but are not limited to, cell phones, tablet computers, notebooks, electronic book readers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.

The processing device 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like; in a possible implementation manner, the processing device may be a terminal device with a certain processing capability, such as a tablet computer, a notebook computer, or the like.

In the embodiment of the present application, communication between the client device 110 and the processing device 120 may be performed through a wired network or a wireless network. In the following description, the relevant processing procedure will be described only from the viewpoint of the processing apparatus 120.

The style conversion process of the image is schematically described below in connection with possible application scenarios:

and selecting a picture in the cartoon video by using the first scene.

In the service scene corresponding to the application scene, the processing device needs to perform face detection on each video frame in the cartoon video for the purpose of selecting the video frame from the cartoon video as a cover map or for the purpose of selecting the video frame meeting the requirements from the cartoon video.

Based on this, in order to realize face detection of a cartoon video, it is necessary to train a target face detection model capable of realizing a face detection function, and therefore, it is necessary to generate a training sample of a cartoon style by means of a style conversion model, and further train an initial face detection model into the target face detection model by means of the training sample.

Therefore, a style conversion model needs to be obtained through training, so that an animation-style target image can be generated based on an original image associated with face annotation data and an animation-style indication image by means of the style conversion model, and a training sample of an initial face detection model is built based on the target image.

And detecting a face area from the image of the appointed style by applying the second scene.

In the business scenario corresponding to the application scenario two, the processing device needs to perform face detection for the corresponding style image for the purpose of identifying the face area from the various styles of images.

Based on the above, in order to realize face detection on images of different styles, a target face detection model capable of realizing a face detection function needs to be trained; therefore, it is necessary to generate images of different styles by means of the same style conversion model, so that the target face detection model can be trained by means of the images of different styles.

In a feasible implementation mode, the style conversion model can be obtained by means of fine tuning training of images of different styles, so that the images of different styles can be generated through the style conversion model; and constructing each training sample for training the initial face detection model based on the images of different styles.

In addition, it should be understood that in the embodiments of the present application, the style conversion process for images is involved, and when the embodiments described in the present application are applied to specific products or technologies, the collection, use and processing of relevant data complies with relevant laws and regulations and standards of relevant countries and regions.

The following describes a processing procedure related to style conversion of an image from the viewpoint of a processing apparatus with reference to the accompanying drawings:

It should be noted that, in some possible implementation manners in the embodiment of the present application, the processing device may obtain a style conversion model according to self-training to implement style conversion of the image; or in other possible implementations, the processing device may implement style conversion of the image according to a style conversion model obtained from other devices. In the following description of the present application, only the processing device is used to self-train to obtain a style conversion model as an example, and the related processing procedure is described:

Referring to fig. 2, which is a schematic diagram of a process of obtaining a style conversion model in an embodiment of the present application, a process of obtaining a style conversion model by a processing device will be described with reference to fig. 2 below:

step 201: the processing device obtains each training sample.

In the embodiment of the application, the processing device can construct each training sample aiming at the training of the style conversion model according to the actual processing requirement, or the processing device can acquire each training sample constructed by other devices, wherein one training sample comprises: a sample-style image of a target style, other-style images of a non-target style, a sample image, facial annotation data for the sample image, and a face image cropped from the sample image according to the facial annotation data.

Under the condition that the processing equipment constructs each training sample by itself, when each training sample is generated, the processing equipment firstly acquires each original sample for carrying out real face detection, wherein one original sample comprises the following components: one sample image, and face annotation data of the one sample image; then, for each original sample, the following operations are performed: cutting at least one face image from the sample image according to the face annotation data of the sample image in the original sample, acquiring a sample style image of a target style and other style images of a non-target style, and forming at least one training sample based on the original sample, the at least one face image, the sample style image and the other style images.

Specifically, in the process of constructing each training sample, for each obtained original sample for performing real face detection, the processing device may use sample data for real face detection, which is an open source, as each original sample, where, for face labeling data of a sample image in the original sample, the face labeling data includes labeling data for indicating a position of a face region in the image; a training sample includes: a sample style image of the target style and other style images of the non-target style, a sample image, facial annotation data for the sample image, and a face image taken from the sample image.

In the embodiment of the application, the target style and other styles are relative to the actual processing requirement, the style of the image refers to the artistic effect presented by the image, and according to the actual processing requirement, the target style can be any one of the styles of cartoon style, antique style, oil painting style and the like; based on the above, the processing device may determine the style corresponding to the image to be subjected to face detection, that is, the style of the face image desired after the style conversion, as the target style, and randomly select other styles based on the target style.

In addition, in a feasible embodiment, sample style images included in different training samples may be the same, or sample style images included in part of the training samples may be the same; other style images included in different training samples may correspond to the same style, or other style images included in a portion of the training samples correspond to the same style; alternatively, other style images included in different training samples may be the same, or other style images included in a part of the training samples may be the same, which is not particularly limited by the present application.

It should be understood that, because there are very many analysis researches on real faces at present, the existing data set for real face detection is very huge, and there are face data with very many sources, and optionally, the application can obtain each original sample according to the face data with the sources and the actual service requirements, and then generate each training sample according to each original sample.

For example, assume that only the location of the face region is of interest in the actual business requirements; the face data of the open source may include: labeling data for indicating positions of face regions and labeling data for indicating positions of key points of the respective faces; then, when each original sample is acquired, only the labeling data indicating the position of the face region may be used as the face labeling data associated with the sample image in the original sample.

For example, referring to fig. 3, which is a schematic diagram of a process of constructing a training sample in the embodiment of the present application, it is assumed that a target style is a cartoon style, and an obtained original sample includes a sample image a and face labeling data of the sample image a; when a training sample is constructed, a frame can be extracted from the video of the animation style as a sample style image, other style images of a non-animation style (such as oil painting style) can be obtained, and a face image is intercepted from the sample image A based on face annotation data of the sample image A; further, a training sample is constructed based on the obtained oil painting style other style image, the cartoon style sample style image, the sample image A, the face annotation data of the sample image A, and the face image intercepted from the sample image A.

Therefore, each training sample can be quickly constructed according to each original sample for real face detection, and the labeling result in the original sample can be directly used when the training sample is constructed, so that the labeling process when the training sample is generated is avoided, the generation efficiency of the training sample is improved, and the construction cost of the training sample is reduced.

Step 202: and the processing equipment performs multiple rounds of alternate training on the built initial generation network and the built initial discrimination network according to each training sample until a preset convergence condition is met, a target generation network obtained based on the initial generation network training is obtained, and a style conversion model is built based on the target generation network.

In the embodiment of the application, the processing equipment adopts an alternate training mode to perform multiple rounds of training on the built initial generation network and the built initial discrimination network, wherein the alternate training refers to round stream training on the initial generation network and the initial discrimination network, so that the initial generation network and the initial discrimination network compete with each other and learn each other in the training process; when the initial generation network is trained, keeping the network parameters in the initial discrimination network unchanged, and only adjusting the network parameters of the initial generation network; similarly, when the initial discrimination network is trained, the network parameters in the initial generation network are kept unchanged, and only the network parameters of the initial discrimination network are adjusted.

When the initial generation network is constructed, the initial generation network may be constructed based on any one of a generator network ANIMEGANV, a generator network ANIMEGANV, a generator network in another GAN network, and a Diffusion (Diffusion) network, where when the initial generation network is constructed based on the network structure of the generator ANIMEGANV2, and in the case that the animation style is the target style, a video frame extracted from the high-definition animation video may be used as a sample style image of the target style; furthermore, ANIMEGANV uses layer normalization to improve the artifact problem compared to ANIMEGANV and uses a lighter weight generation network to enable more reliable style conversion results.

Taking the initial generation network constructed based on ANIMEGANV as an example, referring to fig. 4, which is a network structure schematic diagram of the initial generation network constructed in the embodiment of the present application, a processing device may combine a generator network in ANIMEGANV2 with a style migration algorithm (ADAPTIVE INSTANCE normalization, adaIN) module to obtain the network structure illustrated in fig. 4, where the constructed initial generation network includes various network contents including a convolution layer (recorded as Conv), a layer normalization (recorded as LN), an activation function layer (recorded as Lrelu), an inverse residual module (Inverted Residual Block, IRB), a summation module (recorded as SUM), an activation function tanh layer, and a resizing layer (recorded as Resize); reference to "k", "s", and "c" in fig. 4 are parameters of the convolution layer, and characterize the convolution kernel (kernel), convolution step size (stride), and channel (channel), respectively.

It should be noted that, the input of AdaIN module includes two image features, and the related processing formulas are as follows:

Wherein, Is the variance of the style image features (or style image feature map) of the sample style image,/>Is the mean value of the style image characteristics; x is the image feature of the other image input.

In addition, the initial discrimination network constructed by the application specifically comprises a style discrimination sub-network and an authenticity discrimination sub-network, wherein the style discrimination sub-network and the authenticity discrimination sub-network can be constructed by adopting the same network structure; the network structure used in constructing the style discrimination sub-network and the authenticity discrimination sub-network may be any one of the discriminator network in BigGAN and the discriminator network in ANIMEGANV.

In the embodiment of the application, a discriminator network structure in BigGAN can be preferably adopted to construct a style discrimination sub-network and an authenticity discrimination sub-network, wherein compared with a ANIMEGANV original discriminator network, the discriminator network structure in BigGAN is more complex and finer; the authenticity judging sub-network is used for distinguishing the generated image from the natural image, the style judging sub-network is used for distinguishing the target style from other styles of images, and the authenticity judging sub-network and the style judging sub-network have the same input.

In the embodiment of the present application, when the initial generation network is constructed, the initial generation network may be an initial parameter of the initial generation network, that is, the initial generation network is an initial generation network for initial training; when the initial discrimination network is constructed, the difference between discrimination characters is considered to be larger, so that the open-source pre-training parameters can be not used, but the full re-training is performed, and the recognition capability more suitable for the target style is provided.

In the following, a round of iterative training process for the initial generation network is described in the alternative training process:

referring to fig. 5A, a schematic diagram of a round of iterative training process for an initial generation network according to an embodiment of the present application is shown, and a round of iterative training process for an initial generation network is described below with reference to fig. 5A:

step 501: the processing device employs an initial generation network, generates a predicted image based on the read sample-style image and the sample image, and generates a predicted face image based on the read sample-style image and the face image.

In the embodiment of the application, the processing device can read a corresponding number of training samples according to the total number of training samples used in a preset round of iterative training, and execute two forward propagation processes based on the read training samples in the round of iterative training.

In a forward propagation process, processing equipment adopts an initial generation network, and based on a sample style image in a read training sample, style image characteristics are extracted; meanwhile, based on sample images in the training samples, extracting sample image features; and then, reversely generating a predicted image based on the fusion result of the style image characteristics and the sample image characteristics.

Referring to fig. 5B, which is an input schematic diagram of an initial generation network in a forward propagation process in the embodiment of the present application, it can be known from the content illustrated in fig. 5B that, assuming that the value of batchsize is 1, when performing one forward propagation in a round of iterative training process, a sample image is taken as input, a sample style image is encoded to obtain style image features, and the style image features are injected into AdaIN modules at different positions, so as to finally obtain a predicted image output by the initial generation network.

In this way, the initial generation network is adopted, the feature extraction can be carried out based on the sample style image and the sample image, and the prediction image with the same image size as the sample image can be generated under the fusion effect of the extracted image features.

Similarly, in the other forward propagation process, the processing device adopts an initial generation network, and extracts style image characteristics based on sample style images in the read training samples, and simultaneously, extracts face image characteristics based on face images in the training samples; and then, reversely generating a predicted face image based on the fusion result of the style image features and the face image features.

For example, referring to fig. 5C, which is a schematic diagram of two forward propagation in one iterative training process in the embodiment of the present application, as can be seen from the content illustrated in fig. 5C, assuming that batchsize is n in model training, in one forward propagation process, sample images (denoted as sample images 1-n) and sample style images (denoted as sample style images 1-n) in the n training samples are simultaneously read and input into the initial generation network to obtain n corresponding predicted images, denoted as predicted images 1-n. Similarly, in the other forward propagation process, the face images (marked as face images 1-n) and sample style images (marked as sample style images 1-n) in the n currently read training samples are input into an initial generation network to obtain n corresponding predicted face images, and the corresponding n predicted face images are marked as predicted face images 1-n.

Step 502: the processing device adopts a style discrimination sub-network in an initial discrimination network for the predicted image, a sample style image in a training sample and other style images, outputs a style discrimination result, and adopts an authenticity discrimination sub-network in the initial discrimination network to output an authenticity discrimination result.

In the embodiment of the present application, when step 502 is executed, the same content is simultaneously input to the style discrimination sub-network and the authenticity discrimination sub-network, and further, by means of different functions implemented by the style discrimination sub-network and the authenticity discrimination sub-network, the processing device obtains a style discrimination result output by the style discrimination sub-network, and obtains an authenticity discrimination result output by the authenticity discrimination sub-network, where the style discrimination result is used for determining whether the style of the input image is the target style; the authenticity discrimination result is used for judging whether the input image is a true image or not.

Optionally, before inputting the image into the style discrimination sub-network and the authenticity discrimination sub-network, data enhancement processing may be performed on the input sample image, where possible data enhancement modes include: under the condition of not changing the face position, simply adding disturbance from the pixel level, such as luminosity conversion, gray scale, color dithering, fuzzy sharpening and the like; or by means of mixed enhancement, such as cutmix, mixup, etc.

In the embodiment of the present application, when determining the corresponding style discrimination result and the true-false discrimination result for the predicted image output by the initial generation network, the sample style image in the training sample, and the other style image, the three forward propagation processes may be performed in the style discrimination sub-network and the true-false discrimination sub-network for the three images to be processed, respectively; in addition, according to the sequence of the images and the image content input into the style discrimination sub-network and the authenticity discrimination sub-network, the real discrimination result corresponding to the style discrimination result can be directly determined, and the real discrimination result corresponding to the authenticity discrimination result can be directly determined.

For example, referring to fig. 5D, which is a schematic diagram of three forward propagation in a round of iterative training process in the embodiment of the present application, as can be seen from fig. 5D, assuming that batchsize is n in model training, three forward propagation is required in training the style discrimination sub-network and the authenticity discrimination sub-network; in the processing process corresponding to the forward propagation 1, n predicted images (marked as predicted images 1-n) generated by the initial generation network based on the read n training samples are respectively input into a style discrimination sub-network and an authenticity discrimination sub-network to obtain corresponding n style discrimination results and authenticity discrimination results. In the same way, in the processing process corresponding to the forward propagation 2, sample style images (marked as sample style images 1-n) in n currently read training samples are respectively input into a style discrimination sub-network and an authenticity discrimination sub-network to obtain corresponding n style discrimination results and authenticity discrimination results; similarly, in the processing process corresponding to the forward propagation 3, other style images (marked as other style images 1-n) in the n currently read training samples are respectively input into a style discrimination sub-network and an authenticity discrimination sub-network to obtain corresponding n style discrimination results and authenticity discrimination results.

Step 503: the processing equipment cuts out a prediction subgraph in the prediction image based on the face labeling data of the sample image, and adjusts network parameters of the initial generation network based on pixel value differences between the prediction subgraph and the prediction face image and result differences between the style discrimination result and the true discrimination result respectively.

In the embodiment of the application, in the process of adjusting network parameters of the initial generation network, the initial generation network is optimized by means of position consistency loss and overall generation countermeasure loss.

Specifically, after obtaining a predicted image generated by an initial generation network and a style discrimination result and an authenticity discrimination result output by an initial discrimination network, the processing device cuts out a predicted sub-image from the predicted image according to face labeling data of a sample image, and further can determine a position consistency loss by calculating a pixel value difference between the predicted sub-image and a predicted face image, wherein the position consistency loss can be realized by means of an L1 loss function, and the related calculation formula is as follows:

Wherein, Representing the pixel value corresponding to the i, j position in the prediction subgraph,/>Representing pixel values corresponding to i and j positions in the predicted face image; the predicted face image and the predicted subgraph have the same image size; /(I)The representation covers the prediction subgraph and the individual pixel locations in the predicted face image.

When the processing device calculates the overall generated countermeasures loss, the processing device calculates by adopting the following formula:

Wherein D refers to an authenticity judging sub-network or a style judging sub-network; g refers to an initially generated network; x in the authenticity discrimination sub-network or the style discrimination sub-network means: sample style images and other style images; d (x) represents the generated predicted image, and log calculates the cross entropy. In the initial generation network, z represents a sample image and a sample-style image of the target style, and G (z) represents a predicted image of the target style.

And then, the processing equipment adjusts network parameters of the initial generation network according to the obtained generation countermeasure loss and the position consistency loss.

In this way, in the process of training the initial generation network, the position of the generated face can be limited by calculating the position consistency loss, so that the initial generation network can be guided to learn to generate the face image of the target style in the face labeling area; in other words, the generation requirement of the human face can be met by forcedly restricting the generation position of the human face, the problem that the current generation network is insensitive to the generation position of the content is solved, the trained target generation network can learn the capability of regenerating the human face image of the target style under the condition that the position of the human face is not changed, and therefore the generation quality and the generation accuracy of the image are improved.

The processing device performs the following operations in a training round for the initial discrimination network: generating a predicted image based on the read sample style image and the sample image and generating a predicted face image based on the read sample style image and the face image by adopting an initial generation network; then, aiming at the predicted image, the sample style image in the training sample and other style images, adopting a style discrimination sub-network in the initial discrimination network to output a style discrimination result, and adopting an authenticity discrimination sub-network in the initial discrimination network to output an authenticity discrimination result; and then, based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and the characteristic difference of the image characteristics of the appointed layer extracted by the predicted image, the sample style image and the other style images respectively, the network parameters of the initial discrimination network are adjusted.

In the process of performing a round of training on the initial discrimination network, the processing device uses the initial generation network to generate the predicted image and the predicted face image, which is the same as the process described in the above step 501, and uses the style discrimination sub-network and the authenticity discrimination sub-network in the initial discrimination network to output the style discrimination result and the authenticity discrimination result, which is the same as the process described in the above step 502, which will not be described in detail in the present application.

In the process of adjusting network parameters of an initial discrimination network based on output results of the initial generation network and the initial discrimination network, processing equipment adjusts network parameters of an true and false discrimination sub-network based on result differences between respective true and false discrimination results and corresponding true and false discrimination results and based on characteristic differences among the image features respectively output by a specified network layer in the true and false discrimination sub-network corresponding to a predicted image, a sample style image and other style images; and based on the appointed network layer in the style discrimination sub-network, corresponding to the feature differences among the predicted image, the sample style image and other style images, respectively outputting the feature differences among the image features, and adjusting the network parameters of the style discrimination sub-network.

Specifically, the processing device may calculate, for the authenticity judging sub-network, a total generated challenge loss based on a result difference between each of the style judging result and the authenticity judging result and a corresponding real judging result, and calculate, according to a specified network layer in the authenticity judging sub-network, a feature difference between features of each image respectively output corresponding to the predicted image, the sample style image, and other style images, a corresponding first contrast loss, where a calculation formula of the first contrast loss is as follows:

Wherein, A first contrast loss for the calculation; /(I)For a specified network layer in the style discrimination sub-network, an image feature output with the image in () as an input; γ1 is a custom parameter, specifically a margin separating a negative feature from a positive feature; z1 refers to a sample style image; z2 refers to other styles of images; y1 refers to a predicted image generated by the initial generation network.

Similarly, for the style discrimination sub-network, the processing device may calculate the overall generated contrast loss based on the result difference between each of the style discrimination result and the true discrimination result, and calculate the corresponding second contrast loss according to the feature differences between the image features respectively output by the designated network layer in the style discrimination sub-network, the corresponding predicted image, the sample style image and the other style images, wherein the calculation formula of the second contrast loss is as follows:

Wherein, A second contrast loss calculated; /(I)For a specified network layer in the style discrimination sub-network, an image feature output with the image in () as an input; γ2 is a custom parameter, specifically a margin separating a negative feature from a positive feature; z1 refers to a sample style image; z2 refers to other styles of images; y1 refers to a predicted image generated by the initial generation network.

In addition, for the selected designated network level, referring to fig. 5E, which is a schematic diagram of a network structure of the grid determining sub-network according to the embodiment of the present application, according to the network level illustrated in fig. 5E, one network level before the full connection layer may be selected as the designated network level according to actual processing requirements, that is, the content of the box selection in fig. 5E.

In this way, when training an initial discrimination network including a double discrimination network, a generated image and a natural image can be discriminated by means of the true/false discrimination sub-network, and images of a target style and other styles can be discriminated by means of the style discrimination sub-network; moreover, the inputs to both sub-networks are identical, both of which consist of the generated predicted image, sample-style image, and other style images. The difference of the discrimination scores of the initial discrimination network on the real image and the generated image is minimized, so that the initial discrimination network can learn and discriminate the real image and the generated image, and further the initial generation network is helped to generate a more realistic image; in addition, by means of the first comparison loss, the true and false judging sub-network can be guided to learn and distinguish the characteristic difference between the generated image and the natural characteristic, and by means of the second comparison loss, the style judging sub-network can be guided to learn and distinguish the characteristic difference between the image of the target style and the images of other styles, and further the initial generation network can be assisted to learn the data distribution of the target style more effectively.

Furthermore, the processing device may perform multiple rounds of alternate training on the initial generation network and the initial discrimination network according to the training steps performed in the one round of training process according to actual processing requirements until a preset convergence condition is met, so as to obtain a target generation network based on the initial generation network training, and further construct a style conversion model based on the target generation network.

It should be noted that, the initial convergence condition may be that the image quality generated by the initial generation network meets the requirement, and the number of times that the loss value of the model IS continuously lower than the first set value reaches the second set value, where the values of the first set value and the second set value are set according to the actual processing requirement, and the image quality IS measured by means of (Inception Score, IS) indicators, or (Frechet Inception Distance, FID) indicators, where the higher the IS value, the better the image quality, and the smaller the FID value, the better the image quality. In a possible implementation manner, a corresponding first threshold value and a corresponding second threshold value may be set for the FID index and the IS index, respectively, so that an image with the FID index lower than the first threshold value and the IS index higher than the second threshold value may be determined that the image quality meets the requirement.

For example, the processing device may store different training phases, based on network parameters obtained by initial generation network and initial discrimination network training, and finally select the network parameters capable of generating the best quality image in the training results of the respective training phases.

In this way, the initial generation network and the initial discrimination network are subjected to countermeasure training, so that the initial generation network can be trained to obtain the target generation network capable of generating the high-quality target-style image, the target generation network has good image generation capability, and the target-style face image can be generated in the original face area in the real image under the guidance of the target-style image.

Referring to fig. 6, which is a schematic diagram of a style conversion process of an image according to an embodiment of the present application, a style conversion process of an image performed based on a constructed style conversion model is described below with reference to fig. 6:

Step 601: the processing device obtains a style indication image of a target style.

In particular, in order to provide style reference data for the style conversion process of an image, the processing device needs to acquire a style indication image of a target style, where the style indication image may be taken from a video of the target style, and includes an image of face content.

For example, assuming that the target style is a cartoon style, one frame of image including a cartoon face may be extracted in a cartoon video as a style indicating image.

Step 602: the processing device obtains an original image associated with face annotation data, wherein the face annotation data is used for identifying a face position in the original image.

Specifically, the processing device may acquire an original image associated with face annotation data from data that is open-source and used for real face detection, or may select a photo associated with face annotation data of a related object as the original image if authorization of the related object is obtained, or may use a video frame extracted from a video and associated with face annotation data as the original image if authorization is obtained.

The face labeling data may specifically be coordinate data capable of locating the face region, for example, the upper left corner coordinates and the lower right corner coordinates of the rectangular frame of the face region.

Step 603: the processing equipment adopts a style conversion model, extracts style image characteristics based on the style indication image, extracts original image characteristics based on the original image, and reversely generates a target image of a target style based on a fusion result of the style image characteristics and the original image characteristics.

Specifically, after the processing device acquires the style indication image and the original image, a style conversion model constructed based on a trained target generation network is adopted, style image characteristics are extracted for the style indication image, and original image characteristics are extracted for the original image; further, by fusing the style image features and the original image features, the influence of the style image features is exerted in the original image features, so that the face image of the target style can be generated in the content area corresponding to the face annotation data in the image while the image of the target style is generated.

In this way, the face data of the target style can be accurately generated in correspondence with the content region to which the face annotation data is associated, so that the face annotation data of the original image can be directly used as the face annotation data of the generated target image, which corresponds to the target image to which the target style is directly generated and to which the face annotation data is associated.

Further, after generating the target image based on the style conversion model, the processing device may determine face annotation data of the original image as face annotation data of the target image, and construct a training sample of the target style based on the target image and the face annotation data thereof; and performing multiple rounds of iterative training on the constructed initial face detection model by adopting a training sample of the target style until a preset convergence condition is met, so as to obtain the target face detection model.

It should be noted that, in the embodiment of the present application, an initial face detection model may be constructed by using any one of Yolov networks, RNNs, etc., where Yolov network is a very effective object detection algorithm, integrates multiple effective practices in the detection field, and has advantages of high speed, high precision, real-time, etc. Optionally, in the embodiment of the present application, the processing device may construct an initial face detection model based on the pre-trained network structure of the open source, so as to improve the convergence speed of the initial face detection model. The preset convergence condition may be that the number of model training rounds reaches a set value, etc., which is not particularly limited in the present application. In addition, considering that in the scheme of the application, one possible application scenario is to generate a training sample of the initial face detection model, the application does not specifically limit the training mode adopted in training the initial face detection model.

In this way, by means of the style conversion model, a large number of target images of the target style, which are associated with the annotation data, can be obtained through conversion based on the face image to be converted (namely the original image) and the face image of the target style (namely the style indication image) which are associated with the annotation data; moreover, the local initial face detection model is trained by using a large amount of labeling data, so that the trained target face detection model can learn the face characteristics in the target-style image, and the detection effect on the target-style face image is improved.

Furthermore, after the processing equipment trains to obtain the target face detection model, the processing equipment can acquire a video frame sequence of a target style; then, a target face detection model is adopted, face detection processing is respectively carried out on each video frame in the video frame sequence, and face region information corresponding to each video frame is identified; and selecting a target video frame with the face state meeting the set condition from the video frames based on the face region information corresponding to each video frame.

Specifically, in a specific application process, the processing device may perform face detection on each video frame in the video frame sequence of the target style by means of the target face detection model obtained by training, so as to identify face region information in each video frame; in addition, by means of the face region information, the ratio of the face region to the whole image in each video frame can be determined, the number of the face regions included in the video frame can be determined, and further, various video frame acquisition requirements can be met.

For example, since a video frame of a null situation needs to be acquired, one video frame may be selected from among video frames in which no face area is recognized.

For another example, if video frames including as many objects as possible need to be acquired, the total number of target objects may be determined according to the total number of main objects included in the animation of the target style, and one video frame having a total number of detected face areas higher than the total number of target objects is selected in each video frame.

Therefore, the trained target face detection model can be applied to an actual service scene, so that face detection can be performed on the target-style image, and the image content meeting the requirements can be rapidly acquired.

The following describes the relevant processing procedure by taking the conversion of an image into a cartoon style as an example with reference to the accompanying drawings:

referring to fig. 7A, which is a schematic diagram illustrating a process of training to obtain a style conversion model in an embodiment of the present application, according to fig. 7A, in order to construct the style conversion model, an alternate countermeasure training involving an initial generation network and an initial discrimination network is involved, where the initial discrimination sub-network includes an authenticity discrimination sub-network and a style discrimination sub-network. The initial generation network illustrated in fig. 7A includes three inputs, which are only schematically illustrated, and in the actual processing, three images need to be propagated forward twice, and similarly, three inputs of the authenticity discrimination sub-network and the style discrimination sub-network need to be propagated forward three times in the actual training process.

Continuing to refer to fig. 7A, in the alternate countermeasure training process, the related loss values include two types, namely a face position consistency loss and a style consistency loss, wherein the style consistency loss includes a first contrast loss determined by generating a countermeasure loss and aiming at an authenticity discrimination sub-network, and a second contrast loss determined by aiming at a style discrimination sub-network.

The network parameters are adjusted based on face position consistency loss and generation contrast loss when the network parameters are adjusted for the initially generated network, based on generation of the contrast loss and the first contrast loss when the network parameters are adjusted for the authenticity discrimination sub-network, and by means of generation of the contrast loss and the second contrast loss when the network parameters are adjusted for the style discrimination sub-network.

Further, referring to fig. 7B, which is a schematic diagram of a process of training a face detection model based on an output result of a style conversion model in an embodiment of the present application, as will be described with reference to fig. 7B, after the processing device constructs the style conversion model based on the trained target generation network, the processing device inputs a style indication image and a real face image of a cartoon style into the style conversion model to obtain a target image of the cartoon style; and constructing a training sample by face annotation data of the target image and the real face image, and training the pre-trained end-to-end cartoon face detection model to obtain a trained cartoon face detection model.

Further, referring to fig. 7C, which is a schematic diagram of a process of image processing based on a cartoon face detection model in the embodiment of the present application, as can be known from the content illustrated in fig. 7C, the technical scheme provided by the present application can be well applied to various functional image selection systems of a cartoon. After the trained face detection model is used for carrying out face detection on the video frame, the position, the size and the duty ratio information of the face area can be obtained, and the information can provide effective help for downstream graph selection tasks.

According to the specific application scene, the image with the person is supposed to be selected from the video, so that the video frame with the person in the video can be rapidly positioned through the face information obtained through the face detection model. In addition, the face detection model is used for face detection, so that not only can the accuracy of selecting the images be improved, but also the efficiency of selecting the images can be improved. Compared with the prior art that people need to be manually searched in the video, the image selecting process is time-consuming and labor-consuming, omission or mistakes are easy to occur, and the required video frame can be quickly found in a short time by using a face detection model to detect the face. Therefore, time and labor cost can be saved, and quality and efficiency of selecting pictures can be improved.

In the technical scheme provided by the application, the generation position of the human face is strictly constrained in the process of training the initial generation network and the initial judgment network, so that the style conversion model is sensitive to the generation position of the image, the deviation of the generated position of the human face is avoided, the problem of the deviation of the generation position of the generation network is solved, the generation quality and the accuracy of the image are improved, and sample data suitable for the face detection training can be generated; moreover, considering that the distribution gap between the target style and the real style of the image is large, the difference between the target style and the real style of the image may cause the performance of the face detection model to be reduced, so that in order to reduce the difference between the two styles, the application designs two discriminators, one discriminator is used for true and false judgment, and the other discriminator is used for distinguishing the styles; in addition, to further enhance the quality of the generation such that the similarity between classes increases, the first contrast loss and the second contrast loss are also employed to strengthen the constraint.

Furthermore, large-scale and high-quality cartoon face data can be generated by means of the style conversion model, the algorithm model is used for replacing artificial labeling, huge labeling cost can be saved, iteration efficiency is improved, and generalization capability of the cartoon face detection model is enhanced.

Based on the same inventive concept, referring to fig. 8, which is a schematic logic structure diagram of an image style conversion device according to an embodiment of the present application, the image style conversion device 800 includes a first acquiring unit 801, a second acquiring unit 802, and a generating unit 803, where,

A first acquiring unit 801 configured to acquire a style indication image of a target style;

A second obtaining unit 802, configured to obtain an original image associated with face labeling data, where the face labeling data is used to identify a face position in the original image;

A generating unit 803, configured to use a style conversion model, extract a style image feature based on the style indication image, extract an original image feature based on the original image, and reversely generate a target image of a target style based on a fusion result of the style image feature and the original image feature;

The method comprises the steps of generating a face image with a target style in a content area corresponding to face annotation data in a target image; the style conversion model is built based on the target generation network obtained by training after the countermeasure training of the initial generation network and the initial discrimination network is completed.

Optionally, the apparatus further includes a training unit 804, where the style conversion model is obtained by the training unit 804 in the following manner:

Obtaining each training sample, wherein one training sample comprises the following steps: a sample style image of a target style, other style images of a non-target style, a sample image, face annotation data of the sample image, and a face image cut out from the sample image according to the face annotation data;

And carrying out multiple rounds of alternate training on the built initial generation network and the built initial discrimination network according to each training sample until a preset convergence condition is met, obtaining a target generation network based on the initial generation network training, and building a style conversion model based on the target generation network.

Optionally, each training sample is generated by training unit 804 in the following manner:

for each original sample, the following is performed: cutting at least one face image from the sample image according to the face annotation data of the sample image in the original sample, acquiring a sample style image of a target style and other style images of a non-target style, and forming at least one training sample based on the original sample, the at least one face image, the sample style image and the other style images.

Optionally, during a round of training for the initial generation network, the training unit 804 is configured to perform the following operations:

Generating a predicted image based on the read sample style image and the sample image and generating a predicted face image based on the read sample style image and the face image by adopting an initial generation network;

for a predicted image, a sample style image in a training sample and other style images, a style discrimination sub-network in an initial discrimination network is adopted to output a style discrimination result, and an authenticity discrimination sub-network in the initial discrimination network is adopted to output an authenticity discrimination result;

Based on the face labeling data of the sample image, a prediction sub-image is cut out from the prediction image, and based on the pixel value difference between the prediction sub-image and the prediction face image and the result difference between the style discrimination result and the true discrimination result, the network parameters of the initial generation network are adjusted.

Optionally, during a round of training for the initial discrimination network, the training unit 804 is configured to perform the following operations:

Based on the result difference between the style discrimination result and the true discrimination result, and the characteristic difference of the image characteristic of the appointed layer extracted by the predicted image, the sample style image and the other style images, the network parameters of the initial discrimination network are adjusted.

Optionally, when the network parameters of the initial discrimination network are adjusted based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and the feature difference of the image features of the designated layer extracted by the prediction image, the present style image and the other style images respectively, and inside the initial discrimination network, the training unit 804 is configured to:

based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and based on the appointed network layer in the true discrimination sub-network, corresponding to the predicted image, the sample style image and other style images, respectively outputting the characteristic difference between the image characteristics, and adjusting the network parameters of the true discrimination sub-network;

Based on the result difference between the style discrimination result and the true discrimination result, and based on the appointed network layer in the style discrimination sub-network, corresponding to the feature difference among the image features respectively output by the predicted image, the sample style image and other style images, the network parameters of the style discrimination sub-network are adjusted.

Optionally, when generating the predicted image based on the read sample style image and the sample image, the training unit 804 is configured to:

Optionally, after reversely generating the target image of the target style, the generating unit 803 is further configured to:

And carrying out multiple rounds of iterative training on the constructed initial face detection model by adopting a training sample of a target style until a preset convergence condition is met, so as to obtain the target face detection model.

Optionally, after outputting the trained target face detection model, the generating unit 803 is further configured to:

Acquiring a video frame sequence of a target style;

Adopting a target face detection model, respectively carrying out face detection processing on each video frame in a video frame sequence, and identifying face region information corresponding to each video frame;

For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.

Having described the style conversion method and apparatus of an image according to an exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

The embodiment of the application also provides electronic equipment based on the same conception as the embodiment of the method. Referring to fig. 9, a schematic diagram of a hardware component of an electronic device to which an embodiment of the present application is applied, in one embodiment, the electronic device may be the processing device 120 shown in fig. 1. In this embodiment, the electronic device may be configured as shown in fig. 9, including a memory 901, a communication module 903, and one or more processors 902.

A memory 901 for storing a computer program executed by the processor 902. The memory 901 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 901 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 901 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a solid state disk (solid-state disk) (STATE DRIVE, SSD); or memory 901, is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 901 may be a combination of the above memories.

The processor 902 may include one or more central processing units (central processing unit, CPUs) or digital processing units, or the like. A processor 902 for implementing the style conversion method of the image when calling the computer program stored in the memory 901.

The communication module 903 is used to communicate with the client device and the server.

The specific connection medium between the memory 901, the communication module 903, and the processor 902 is not limited in the embodiment of the present application. The embodiment of the present application is shown in fig. 9, where the memory 901 and the processor 902 are connected by a bus 904, where the bus 904 is depicted in bold in fig. 9, and the connection between other components is merely illustrative, and not limiting. The bus 904 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 9, but only one bus or one type of bus is not depicted.

The memory 901 stores a computer storage medium in which computer executable instructions for implementing the style conversion method of an image according to an embodiment of the present application are stored. The processor 902 is configured to perform the style conversion method of the image as described above, as shown in fig. 5A.

In another embodiment, the electronic device may be another electronic device, and referring to fig. 10, a schematic diagram of a hardware composition of another electronic device to which the embodiment of the present application is applied, where the electronic device may specifically be the client device 110 shown in fig. 1. In this embodiment, the structure of the electronic device may include, as shown in fig. 10: communication component 1010, memory 1020, display unit 1030, camera 1040, sensor 1050, audio circuit 1060, bluetooth module 1070, processor 1080 and the like.

The communication component 1010 is for communicating with a server. In some embodiments, a circuit wireless fidelity (WIRELESS FIDELITY, WIFI) module may be included, the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.

Memory 1020 may be used to store software programs and data. Processor 1080 performs various functions and data processing of client device 210 by executing software programs or data stored in memory 1020. The memory 1020 may store an operating system and various application programs, and may also store a computer program for executing a style conversion method of an image according to an embodiment of the present application.

The display unit 1030 may also be used to display information entered by a user or provided to a user as well as a graphical user interface (GRAPHICAL USER INTERFACE, GUI) of various menus of the client device 210. In particular, the display unit 1030 may include a display screen 1032 disposed on the front of the client device 210. The display unit 1030 may be used to display a page or the like of a conversion operation of an image of a target style in the embodiment of the present application.

The display unit 1030 may also be used to receive input numeric or character information and generate signal inputs related to user settings and function control of the client device 210. In particular, the display unit 1030 may include a touch screen 1031 disposed on the front of the client device 210 and may collect touch operations thereon or thereabout by a user.

The touch screen 1031 may be covered on the display screen 1032, or the touch screen 1031 may be integrated with the display screen 1032 to implement the input and output functions of the client device 210, and after integration, the touch screen may be simply referred to as a touch screen. The display unit 1030 may display an application program and corresponding operation steps in the present application.

The camera 1040 may be used to capture still images, and the user may comment the image captured by the camera 1040 through the application. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then passed to a processor 1080 for conversion into a digital image signal.

The client device may also include at least one sensor 1050, such as an acceleration sensor 1051, a distance sensor 1052, a fingerprint sensor 1053, and a temperature sensor 1054. The client device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.

Audio circuitry 1060, speakers 1061, microphone 1062 may provide an audio interface between a user and the client device 210. Audio circuit 1060 may transmit the received electrical signal after conversion of the audio data to speaker 1061 for conversion by speaker 1061 into an audio signal output. On the other hand, microphone 1062 converts the collected sound signals into electrical signals, which are received by audio circuitry 1060 and converted into audio data, which are output to communications component 1010 for transmission to, for example, another client device 210, or to memory 1020 for further processing.

The bluetooth module 1070 is used for exchanging information with other bluetooth devices having a bluetooth module through a bluetooth protocol.

Processor 1080 is a control center of the client device and connects the various parts of the overall terminal using various interfaces and lines, performs various functions of the client device and processes data by running or executing software programs stored in memory 1020 and invoking data stored in memory 1020. In some embodiments, processor 1080 may include at least one processing unit; processor 1080 may also integrate the application processor and the baseband processor. Processor 1080 of the present application may run an operating system, applications, user interface displays, and touch responses, as well as methods for style conversion of images according to embodiments of the present application. In addition, a processor 1080 is coupled to the display unit 1030.

In some possible embodiments, aspects of the method for style conversion of an image provided by the present application may also be implemented in the form of a program product comprising a computer program for causing an electronic device to perform the steps in the method for style conversion of an image according to the various exemplary embodiments of the present application described herein above when the program product is run on the electronic device, e.g. the electronic device may perform the steps as shown in fig. 5A.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may take the form of a portable compact disc read only memory (CD-ROM) and comprise a computer program and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having a computer-usable computer program embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A style conversion method of an image, comprising:

acquiring a style indication image of a target style;

The face image with the target style is generated in a content area corresponding to the face annotation data in the target image; the style conversion model is constructed based on a target generation network obtained by training after finishing the countermeasure training of the initial generation network and the initial discrimination network;

In the training process of the initial discrimination network, the following operations are executed:

generating a prediction image based on the sample style image and the sample image in the read training sample by adopting the initial generation network;

Based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and based on a designated network layer in the true discrimination sub-network, corresponding to the predicted image, the sample style image and other style images, respectively outputting characteristic differences among the image characteristics, and adjusting network parameters of the true discrimination sub-network; and adjusting network parameters of the style discrimination sub-network based on the result difference between the style discrimination result and the true discrimination result and the corresponding true discrimination result, and based on the appointed network layer in the style discrimination sub-network, corresponding to the feature difference among the image features respectively output by the prediction image, the sample style image and other style images.

2. The method of claim 1, wherein the style conversion model is obtained by:

3. The method of claim 2, wherein each training sample is generated by:

4. The method of claim 2, wherein during a round of training for an initially generated network, the following is performed:

5. The method of any of claims 1, 3-4, wherein generating a predicted image based on the read sample style image and the sample image comprises:

6. The method of any of claims 1-4, wherein after the reverse generating the target image of the target style, further comprising:

7. The method of claim 6, wherein after outputting the trained target face detection model, further comprising:

Acquiring a video frame sequence of the target style;

8. An image style conversion device, comprising:

9. The apparatus of claim 8, further comprising a training unit, the style conversion model being derived by the training unit by:

10. The apparatus according to claim 8 or 9, wherein after the inverse generating the target image of the target style, the generating unit is further configured to:

11. The apparatus of claim 10, wherein after the outputting the trained target face detection model, the generating unit is further configured to:

Acquiring a video frame sequence of the target style;

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-7 when the computer program is executed by the processor.

13. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the method according to any of claims 1-7 when executed by a processor.

14. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.