CN116758379B

CN116758379B - Image processing method, device, equipment and storage medium

Info

Publication number: CN116758379B
Application number: CN202311017779.3A
Authority: CN
Inventors: 张菁芸; 侯锦坤; 郭润增; 王少鸣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2024-05-28
Anticipated expiration: 2043-08-14
Also published as: CN116758379A

Abstract

The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a storage medium, wherein the image processing method comprises the following steps: acquiring a target image with a first imaging style; performing style conversion processing on the target image according to the style conversion rule, and performing recognition processing on the reference image obtained by the processing to obtain an image recognition result; the reference image has a second imaging style, the style conversion rule is obtained by training and learning based on first content characteristics of sample images of the sample group and second content characteristics of the sample images of the sample group, the first content characteristics are used for describing common image object characteristics of a target object in the sample images among different sample images, and the second content characteristics are used for describing detail characteristics of the target object in the sample images; each sample image of the sample group comprises images with different imaging styles acquired by adopting different image acquisition modules aiming at a target object. Therefore, the effectiveness and the accuracy of image identification after the imaging style conversion can be improved.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

With the development of artificial intelligence technology, image recognition is widely applied to various business scenes. For example: business processes may be performed by identifying face images in a face-swipe business scenario (e.g., a face-swipe payment, face-swipe login, etc. business scenario). However, because the images collected by different image collecting modules have differences in imaging styles, such as differences in definition and exposure brightness, if the image collecting modules used in the service scene are replaced, the imaging styles of the collected images change, and the feature extraction and recognition modes related to the image collecting modules before replacement can bring adverse effects to the recognition of the images due to the change of the imaging styles, so that the recognition effectiveness and accuracy can be reduced.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a storage medium, which can improve the effectiveness and the accuracy of image identification after imaging style conversion.

In one aspect, an embodiment of the present application provides an image processing method, including:

acquiring a target image to be processed, wherein the target image has a first imaging style;

performing style conversion processing on the target image according to the style conversion rule to obtain a reference image corresponding to the target image, wherein the reference image has a second imaging style; the style conversion rule is obtained by training and learning based on first content features of sample images of the sample group and second content features of sample images of the sample group, wherein the first content features are used for describing common image object features of target objects in the sample images among different sample images, and the second content features are used for describing detail features of the target objects in the sample images;

performing recognition processing on the reference image to obtain an image recognition result;

each sample image of the sample group comprises images acquired by adopting different image acquisition modules aiming at a target object, and the sample images acquired by the different image acquisition modules have different imaging styles.

In another aspect, an embodiment of the present application provides an image processing apparatus, including:

The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target image to be processed, and the target image has a first imaging style;

The processing unit is used for carrying out style conversion processing on the target image according to the style conversion rule to obtain a reference image corresponding to the target image, wherein the reference image has a second imaging style; the style conversion rule is obtained by training and learning based on first content features of sample images of the sample group and second content features of sample images of the sample group, wherein the first content features are used for describing common image object features of target objects in the sample images among different sample images, and the second content features are used for describing detail features of the target objects in the sample images;

The processing unit is also used for carrying out identification processing on the reference image to obtain an image identification result;

In an embodiment, the style conversion rule includes an image processing model, and the processing unit is further configured to:

Acquiring a sample set, wherein the sample set comprises a plurality of sample groups, each sample group at least comprises two sample images, and different sample images in each sample group are images acquired by different image acquisition modules on the same target object;

Training a target model through a sample set to obtain an image processing model;

Wherein the first content feature comprises: carrying out common feature extraction processing on a first sample image in a sample group to obtain semantic feature information corresponding to the first sample image; the semantic feature information includes: one or two of the five-sense organ characteristic information of the target object and the facial depth information of the target object.

In one embodiment, the processing unit is specifically configured to, when training the target model through the sample set to obtain the image processing model:

Acquiring semantic feature information from a first sample image included in a target sample group in a sample set through a first model; the first model is obtained by carrying out knowledge distillation processing based on the second model, and is used for carrying out common feature extraction processing of the sample image;

Performing feature extraction processing on a second sample image included in the target sample group through a second model to obtain detail features of the second sample image;

Training a target model according to semantic feature information of the first sample image and detail features of the second sample image and according to the test images set for the target sample group to obtain an image processing model.

In an embodiment, the processing unit is further configured to, prior to obtaining, by the first model, semantic feature information from a first sample image comprised by a target sample group in the sample set:

Invoking a student network to perform feature extraction processing on a first sample image in the target sample group to obtain a first content feature of the first sample image, and invoking a second model to perform feature extraction processing on the first sample image in the target sample group to obtain a second content feature of the first sample image;

Invoking a second model to perform feature extraction processing on a second sample image in the target sample group to obtain a second content feature of the second sample image;

And carrying out knowledge distillation processing aiming at the student network according to the second content characteristics of the first sample image and the second content characteristics of the second sample image obtained by the second model and the first content characteristics of the first sample image obtained by the student network so as to obtain a first model corresponding to the student network.

In one embodiment, the processing unit is specifically configured to, when performing the knowledge distillation process for the student network according to the second content feature of the first sample image obtained by the second model and the second content feature of the second sample image obtained by the student network, and according to the first content feature of the first sample image obtained by the student network:

taking the content characteristics which meet the characteristic consistency condition between the second content characteristics of the first sample image and the second content characteristics of the second sample image obtained by the second model as the reference content characteristics of the first sample image;

Determining target loss corresponding to the first sample image in the distillation process according to the reference content characteristics of the first sample image and the first content characteristics of the first sample image obtained by the student network;

And when the target loss meets the parameter adjustment condition, performing parameter adjustment on the student network to obtain a first model corresponding to the student network.

In one embodiment, the processing unit is specifically configured to, when determining the target loss corresponding to the first sample image during the distillation process according to the reference content feature of the first sample image and the first content feature of the first sample image obtained by the student network:

According to first content characteristics of a first sample image obtained by a student network, respectively carrying out category prediction processing according to different distillation temperatures T to obtain a first prediction result of the first sample image and a second prediction result of the first sample image;

performing category prediction processing based on the reference content characteristics of the first sample image through the second model to obtain a soft tag of the first sample image, and obtaining a hard tag set by the first sample image;

determining a first prediction loss of the first sample image according to the first prediction result and the soft tag, and determining a second prediction loss of the first sample image according to the second prediction result and the hard tag;

And determining a target loss corresponding to the first sample image in the knowledge distillation process by using the first predicted loss and the second predicted loss.

In one embodiment, the processing unit is specifically configured to, when training the target model according to the semantic feature information of the first sample image and the detail feature of the second sample image and according to the test image set for the target sample group to obtain the image processing model:

Invoking a target model, and generating a predicted image based on semantic feature information of the first sample image and detail features of the second sample image;

acquiring a first test image and a second test image which are set for a sample group, wherein the first test image is matched with the first sample image, and the second test image is matched with the second sample image;

Invoking a discrimination network to perform discrimination processing on the predicted image according to the first test image to obtain a discrimination result, wherein the discrimination result is used for indicating the authenticity of a target object in the predicted image;

Content consistency comparison is carried out on the predicted image and the second test image to obtain a comparison result, and the comparison result is used for indicating whether the image content between the predicted image and the second test image meets consistency conditions or not;

And updating the parameters of the target model according to the comparison result and the discrimination result to obtain an image processing model.

In one embodiment, the processing unit is specifically configured to, when performing content consistency comparison on the predicted image and the second test image to obtain a comparison result:

Invoking a second model to perform feature extraction processing on the predicted image to obtain content features of the predicted image, and invoking the second model to perform feature extraction processing on the second test image to obtain the content features of the second test image;

and carrying out consistency comparison on the content characteristics of the predicted image and the content characteristics of the second test image to obtain a comparison result.

In an embodiment, the processing unit is further configured to, before invoking the object model, generate the predicted image based on the semantic feature information of the first sample image and the detail features of the second sample image:

Singular value decomposition is carried out on parameters of each layer of network in the target model, so that a parameter decomposition result corresponding to each layer of network is obtained, wherein the parameter decomposition result comprises a singular value matrix;

and carrying out normalization processing on the singular value matrix included in the parameter decomposition result corresponding to each layer of network, and obtaining a normalized parameter matrix according to the normalized singular value matrix, wherein the normalized parameter matrix comprises normalized parameters corresponding to the corresponding layer of network.

In an embodiment, the processing unit is further configured to, after invoking the object model, generate the predicted image based on the semantic feature information of the first sample image and the detail features of the second sample image:

determining generation loss of the predicted image according to the predicted image, the weight parameters of the target model and the first sample image, and determining a regularization term based on parameters included in the target model;

Parameters of the target model are updated based on the generation loss of the predicted image and the regularization term.

In one embodiment, the target model is further added with random noise vectors for training in the training process, and the random noise vectors are generated according to random numbers obtained by sampling from normal distribution; the processing unit is further configured to:

the random noise vector is truncated by adopting a truncation parameter to obtain a truncated noise vector; the truncated noise vector comprises a plurality of random noise values, and each random noise value is in a preset range;

the truncated noise vector is used to generate a predicted image.

In one embodiment, the image recognition result is obtained by calling an image recognition model to perform recognition processing on the reference image, and the image recognition model is obtained by training an image with a second imaging style.

In still another aspect, an embodiment of the present application provides an image processing apparatus including an input interface and an output interface, the image processing apparatus further including: a processor and a computer storage medium;

Wherein the processor is adapted to implement one or more instructions and the computer storage medium stores one or more instructions adapted to be loaded by the processor and to perform the above mentioned image processing method.

In yet another aspect, embodiments of the present application provide a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the above-mentioned image processing method.

In yet another aspect, embodiments of the present application provide a computer program product comprising a computer program; the computer program, when executed by a processor, implements the image processing method mentioned above.

In the embodiment of the application, a target image to be processed can be acquired, the target image has a first imaging style, then style conversion processing can be performed on the target image according to style conversion rules to obtain a reference image corresponding to the target image, the reference image has a second imaging style, and then the reference image can be identified through identification rules related to the second imaging style or a model trained based on sample images of the second imaging style to obtain an image identification result. Therefore, through the style conversion processing, the original imaging style can be converted into a new imaging style, and a new image is obtained to perform subsequent recognition processing, so that recognition based on the target image is completed. Thus, the image of any imaging style can be effectively identified. The style conversion rule is obtained by training and learning based on first content features of sample images of the sample group and second content features of the sample images of the sample group, wherein the first content features are used for describing common image object features of target objects in the sample images among different sample images, the second content features are used for describing detail features of the target objects in the sample images, each sample image of the sample group comprises images acquired by adopting different image acquisition modules aiming at the target objects, and the sample images acquired by the different image acquisition modules have different imaging styles. Therefore, in the process of forming the style conversion rule, sample images of different imaging styles formed by different image acquisition modules aiming at a target object can be referred to, common features and detail features among the sample images of different imaging styles are learned, and further, a reliable style conversion rule is obtained so as to support effective conversion of the target image, key information in the target image is reserved while the style is converted, and therefore more accurate identification can be performed.

Drawings

FIG. 1a is a block diagram of an image processing system according to an embodiment of the present application;

FIG. 1b is a schematic view of an image processing scenario according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a target image and a reference image according to an embodiment of the present application;

FIG. 4 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 5a is a block flow diagram of a training student network provided by an embodiment of the present application;

FIG. 5b is a block flow diagram of knowledge distillation provided by an embodiment of the application;

FIG. 5c is a schematic diagram of a target model according to an embodiment of the present application;

FIG. 5d is a schematic structural diagram of a module in a target model according to an embodiment of the present application;

fig. 5e is a schematic structural diagram of a residual block in a discrimination network according to an embodiment of the present application;

FIG. 6a is a schematic diagram illustrating a relationship between hinge loss and classification results according to an embodiment of the present application;

FIG. 6b is a training frame diagram of a target model provided by an embodiment of the present application;

FIG. 7 is a block flow diagram of an image processing method according to an embodiment of the present application;

fig. 8 is a schematic structural view of an image processing apparatus according to an embodiment of the present application;

Fig. 9 is a schematic structural view of an image processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application provides an image processing scheme, which relates to an image processing system, an image processing method and related equipment. The first imaging style can be any imaging style, and can be converted into a reference image with uniform imaging style aiming at a target image with any imaging style, and then the reference image is directly identified, so that the influence of the original imaging style on identification can be avoided, and the identification effectiveness is improved. Especially, under the condition that an image recognition model (which can be simply called as a recognition model) is called for recognition processing, no matter how an image acquisition module for acquiring a target image is upgraded, the recognition model is not required to be trained again, but the target image is converted into a reference image which can be recognized more accurately by the recognition model through style conversion processing, so that the cost and the efficiency are reduced.

The architecture of the image processing system according to the embodiment of the present application will be described below with reference to the accompanying drawings.

Fig. 1a is a schematic diagram of an image processing system according to an embodiment of the application. As shown in fig. 1a, the image processing system comprises an image acquisition device 101, an image processing device 102 and a database 103. The image capturing apparatus 101 and the image processing apparatus 102 establish a communication connection therebetween by wired or wireless, and the database 103 may establish a communication connection therebetween by wired or wireless, respectively, with the image capturing apparatus 101 and the image processing apparatus 102.

The image capturing device 101 may be a computer device having an image capturing function, and the image capturing device 101 may include an image capturing module for capturing an image required for performing a business process in a business scene, for example, a face image in a face payment scene. The image acquisition module is a hardware module, for example: a camera connected with the desktop computer, a shooting module built in the terminal equipment, a digital single-phase inverter and the like. In a possible manner, the image capturing device and the image processing device may be different hardware modules in the same computer device, for example, the image capturing device is a shooting module in the terminal device, and the image processing device is a processor in the terminal device; or may be independent computer devices, for example, the image acquisition device is a terminal device, and the image processing device is a server, which is not limited in this aspect of the application. The image acquired by the image acquisition apparatus 101 may be stored in the database 103 as a sample image to be used for training, or the image acquired by the image acquisition apparatus 101 may be processed in real time as a target image to be processed. It can be appreciated that different image acquisition modules have differences in imaging styles for images acquired by the same target object, which is particularly shown in: sharpness, exposure brightness, color cast, distortion, etc.

The image processing apparatus 102 is operable to execute the image processing method of the present application, and the flow involved in the executed image processing method roughly includes: first, a target image to be processed may be acquired, the target image having a first imaging style. In one implementation, the image processing apparatus 102 may receive the image acquired and transmitted by the image acquisition apparatus 101 as a target image, or the image processing apparatus 102 may acquire a target image to be processed from the image acquisition apparatus 101. In another implementation, the image processing apparatus 102 may acquire a specified image from the database 103 as a target image to be processed. The target image includes, but is not limited to, according to whether biological information is present in the image: a biometric image refers to an image including a biometric feature, such as a face image, and a non-biometric image refers to an image not including a biometric feature, such as an image including only an item.

And then, carrying out style conversion processing on the target image according to the style conversion rule to obtain a reference image corresponding to the target image, wherein the reference image has a second imaging style. The second imaging modality is a baseline modality (baseline), and the reference image may also be referred to as a baseline image (i.e., an image having a baseline modality). The baseline style is a unified style into which target images of different imaging styles need to be converted, for example, an image Q1 acquired by a mipi camera and an image Q2 acquired by a usb-lite camera acquired by the image processing device 102 may be subjected to style conversion processing to obtain a reference image corresponding to the image Q1 and a reference image corresponding to the image Q2, where the two reference images have the same imaging style (i.e., the second imaging style). It can be seen that for any imaging style of target image, it can be converted into a corresponding reference image having a second imaging style. The style conversion rule is obtained by training and learning based on first content features of sample images of a sample group and second content features of the sample images of the sample group, wherein the first content features are used for describing common image object features of target objects in the sample images among different sample images, and the second content features are used for describing detail features of the target objects in the sample images; and each sample image of the sample group comprises images acquired by adopting different image acquisition modules aiming at a target object, and the sample images acquired by the different image acquisition modules have different imaging styles. In this way, the style conversion rule can be obtained through training and learning of sample images with different imaging styles, and in the training and learning, how to extract common features and detail features can be learned based on different content features so as to better process the target image. The stroke lattice transformation rule can be understood as a multi-element style transformation algorithm, and images with different imaging styles can be transformed into images with a baseline style through the multi-element style transformation algorithm so as to be effectively identified.

Then, the image processing apparatus 102 may perform recognition processing on the converted reference image to obtain an image recognition result. The image recognition result may be used to indicate a business process result that is recognized based on the target image, and the image recognition result matches the business scenario. For example, if the target image is a face image in a payment scene, the image recognition result may be a result of payment success or payment failure obtained through face recognition. In one possible embodiment, the image processing device 102 may invoke an image recognition model, which is a recognition model trained from sample images having a second imaging style, to perform a recognition process on the reference image. In this way, if the image acquisition module in the image acquisition device 101 is upgraded, there is no need to re-acquire a new style image or re-train the recognition model, but the target image to be processed is converted into the image of the baseline style supported by the image recognition model for recognition to be directly recognized, so that the loss of manpower and time caused by re-training the image recognition model can be avoided, the cost is reduced, and the recognition efficiency is improved. Experiments show that by adopting the scheme, when the image acquisition module is upgraded, the original time for re-acquiring the training sample to train the image processing model is changed into the time for image conversion, the whole processing flow can be reduced to 12 milliseconds (ms) from the original 30-45 days, the image recognition efficiency is greatly improved, the labor cost and the time cost of recognition are saved, and the cost and the synergy are reduced.

According to the image processing flow, in the process of carrying out recognition processing based on the target image, the target image can be firstly converted into the base line image according to the style conversion rule, and the base line image abandons the original imaging style and is unified into the identifiable base line style, so that the recognition result related to the target image can be obtained by recognizing the base line image with the base line style, the effective recognition of the image can be realized, and the image recognition efficiency can be improved.

The image capturing device 101 may be a terminal device, and the image processing device 102 may include any one or both of a terminal device and a server, and the terminal device includes, but is not limited to: the application is not limited to smart phones, tablet computers, intelligent wearable devices, intelligent voice interaction devices, intelligent home appliances, personal computers, vehicle-mounted terminals, intelligent cameras and other devices. The present application is not limited with respect to the number of terminal devices. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, but is not limited thereto. The present application is not limited with respect to the number of servers.

The image processing system further comprises a database 103, and the database 103 can be a local database of the image processing device 102 or a cloud database capable of establishing connection with the image processing device 102 according to the arrangement position division. The database 103 may be a public database, i.e., a database opened to all image processing apparatuses, in terms of attribute division; but may also be a private database, i.e. a database that is only open to a particular image processing device, such as image processing device 102. The database 103 may be used to provide data support for image processing processes including, but not limited to: providing a target image to be processed, providing a sample image for training, and providing an alignment image required in the recognition process; and can also be used for storing image recognition results. For example, in a face recognition scenario, a comparison may be performed based on a face image in the database 103 and the identified face image, so as to determine whether the face image is a face image stored in the database 103, further determine a service processing result based on the identification of the face image, and store the service processing result in the database 103.

The image processing system provided by the embodiment of the application can be applied to at least one of the following business scenes: a biometric payment scenario, a biometric login scenario, a biometric encyclopedia scenario, and so forth. Illustratively, in connection with the system architecture diagram shown in FIG. 1a, when the image acquisition device and the image processing device are the same computer device, a scene diagram of an exemplary image processing as shown in FIG. 1b may be provided. In the biological recognition payment scene, the payment processing can be performed based on the face image recognition, the face image acquired by the terminal equipment can be converted into a base line image first, then the base line image is directly recognized, an image recognition result is obtained, and the image recognition result can be stored in a database. And when the image recognition result indicates that the recognition is successful, the face image payment is successful, and when the image recognition result indicates that the recognition is failed, the face image payment is failed. By adopting the image processing method provided by the application, even if a hardware module (such as an image acquisition module) for acquiring biological information (such as a face image) is replaced, the acquired image is different in imaging style, the image data of a new style is not required to be acquired as a training sample, the image recognition model is not required to be retrained, the background service is not required to make data annotation, the image recognition model is not required to be updated again and the updated image recognition model is not required to be deployed, but the target image acquired by the currently updated hardware module is converted into a reference image and is directly identified, so that the consumption in the aspects of time, manpower, financial resources and the like is greatly reduced, and the cost and the efficiency are reduced. Schematically, experiments show that mipi cameras for collecting images in face brushing service scenes are replaced by usb-lite cameras, so that the face brushing interception rate of users with masks can be effectively reduced, and the user experience is improved.

Next, an image processing method provided by an embodiment of the present application will be described.

Fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the application. The image processing method may be performed by an image processing apparatus (such as the image processing apparatus 102 in the image processing system shown in fig. 1 a), and the image processing method may include what is described in the following S201 to S203.

S201, acquiring a target image to be processed, wherein the target image has a first imaging style.

The target image to be processed may be an image acquired by an image acquisition module or an image processed by an image processing tool (e.g., a PS tool). The image acquisition module is a hardware module with an image acquisition function and can shoot a target object in a corresponding service scene to obtain an image. In one implementation, the image acquisition module may acquire an image based on the received image acquisition instruction, and take the acquired image as the target image to be processed. For example, in a payment service scenario, if a user selects face-brushing payment, the smart phone may receive an image acquisition instruction and shoot an image containing a face through the camera to identify the face image for payment verification. In a plant identification scenario, a smart phone can shoot a plant in the real world through a camera to identify the category of the plant. The image processing device may acquire the target image to be processed from the image acquisition module, or the image acquisition module may send the acquired target image to the image processing device in real time for subsequent processing.

When the target image is an image acquired by the image acquisition module, the target image has a first imaging style corresponding to the image acquisition module. Based on the hardware performance of the image acquisition module and the algorithm, the image acquisition module has corresponding imaging characteristics when forming an image, so that the image acquisition module has unique imaging style, the imaging style of the image can be understood to be the style of image formation, and the imaging style corresponding to the image acquisition module can reflect the reality degree of the image acquisition module to the shot image. Imaging styles corresponding to different image acquisition modules are different, for example: some images captured by the cameras can reflect the colors in reality truly and objectively, but some images captured by the cameras are biased to a certain color tone and fail to reflect the colors in reality more truly.

S202, performing style conversion processing on the target image according to the style conversion rule to obtain a reference image corresponding to the target image, wherein the reference image has a second imaging style.

Under the corresponding service scene, the identification processing is required based on the acquired target image, if the identification processing is directly performed on the target image, the target image may not be effectively identified due to the diversification of imaging styles corresponding to different image acquisition modules, so that the service processing result is affected. Therefore, the application can firstly perform style conversion processing on the target image according to the style conversion rule, and the general logic of the style conversion processing can be as follows: extracting feature information of the target image according to the style conversion rule, wherein the feature information is key information to be reserved and is not influenced by imaging styles, and generating an image according to the feature information to obtain a reference image. And discarding the original first imaging style of the target image through style conversion processing, and obtaining a reference image with a second imaging style. In this way, the reference image has a different imaging style than the target image, but the reference image retains some information that is highly consistent with the presence of the target image. For example, a target image and a reference image as shown in fig. 3. The target image is a face image, and after style conversion processing, the reference image obtained by conversion and the face image have highly uniform face characteristics, including but not limited to: facial features that are highly similar in terms of facial position, facial size, facial shape, facial depth, etc. That is, certain features of the same target object can be highly uniform regardless of changes in imaging style.

It can be appreciated that, for images of different imaging styles, style conversion can be performed in the manner described above to form a reference image of uniform imaging style. For example, the image processing device may acquire images acquired by different image acquisition modules as target images, for example, 3 images respectively taken by cameras of 3 terminal devices, and for these target images, all the images may be uniformly converted into reference images having the second imaging style. That is, the reference images corresponding to the target images of different imaging styles all have the same imaging style (i.e., the second imaging style, or may be referred to as the baseline style), and the resulting reference image may be referred to as the baseline image. It is noted that the reference images corresponding to different target images have a highly uniform semantic correspondence with the corresponding target images, so that the baseline image can be used as a recognition reference to represent the recognition of the target object, and the result of the business processing based on the recognition of the target object can be determined based on the image recognition result obtained by recognizing the baseline image.

The style conversion rule is obtained by training and learning based on the first content characteristics of the sample images of the sample group and the second content characteristics of the sample images of the sample group. Each sample image of the sample group comprises images acquired by adopting different image acquisition modules aiming at a target object, and the sample images acquired by the different image acquisition modules have different imaging styles.

The target object may be a living or non-living object in the real world, such as a human face, a plant or an animal, or an object in the virtual world, such as a virtual character or a virtual plant, or the like. The plurality of (i.e., at least two) sample images may form a sample group, and different sample images in the sample group are images with different imaging styles acquired for the same target object, for example, face images acquired by using different cameras for photographing the same face. In the process of constructing the sample group, the sample image is formed by shooting through the image acquisition module, so that the sample for training can be quickly acquired.

The first content features are used to describe common image object features of the target object in the sample image between different sample images, and the second content features are used to describe detail features of the target object in the sample image. The common image object features refer to object features common to the same target object in sample images with different imaging styles, and the common image object features corresponding to the target object do not change with the imaging styles. For face images acquired by different image acquisition modules for the same face, the images have high uniform semantic correspondence, so that common image object features, such as uniform face features of facial features, facial depth features and the like in the face images of users, are obtained. Common image object features may be understood as features describing a highly uniform semantic correspondence between different images acquired by different image acquisition devices for the same target object.

S203, performing recognition processing on the reference image to obtain an image recognition result.

The image recognition result is a recognition result of the service scene matching corresponding to the target image. The image recognition result may be used to indicate a result of a business process that is recognized based on the target image. For example, if the image recognition result indicates that the face image is the same object as the object in the face image pre-entered under the corresponding payment account, then the payment can be successfully made based on the recognition of the face image.

In one embodiment, when the image processing device performs recognition processing on the reference image, the image recognition model may be invoked to perform recognition processing on the reference image, so as to obtain an image recognition result. The image recognition model is a recognition model obtained through image training with the second imaging style and can be used for recognizing the image with the second imaging style. The image recognition result can be obtained more accurately and rapidly by calling the recognition model to perform recognition processing, and the image processing of any imaging style can be compatible by converting the image into the reference image of the second imaging style in advance under the service scene needing to be recognized based on the target image, so that the effective recognition based on the target image is realized. In another embodiment, when the image processing device performs recognition processing on the reference image, the image processing device may also perform recognition processing on the reference image through a recognition rule related to the second imaging style, so as to obtain an image recognition result. The recognition rule related to the second imaging style supports accurate and effective recognition of the image of the second imaging style, and therefore the effectiveness and accuracy of the image recognition result can be guaranteed.

According to the image processing method provided by the embodiment of the application, the style conversion processing is carried out on the target image to be processed through the style conversion rule, so that the original imaging style can be converted into a new imaging style, the new image is obtained to carry out subsequent recognition processing, and further recognition based on the target image is completed. Thus, the image of any imaging style can be effectively identified. In the formation process of the style conversion rule, sample images of different imaging styles formed by different image acquisition modules aiming at a target object can be referred to, common features and detail features among the sample images of different imaging styles are learned, and further, a reliable style conversion rule is obtained so as to support effective conversion of the target image, key information in the target image is reserved while the style is converted, and therefore more accurate identification can be performed.

Fig. 4 is a flowchart of another image processing method according to an embodiment of the application. The image processing method may be performed by an image processing apparatus (such as the image processing apparatus 102 in the image processing system shown in fig. 1 a). The embodiment mainly describes a training learning process of style conversion rules. In one embodiment, the style conversion rules include an image processing model having image generation capabilities that can be used to generate a completely new style of reference image. The training procedure for the image processing model includes what is described in the following S401 to S402.

S401, acquiring a sample set.

The sample set comprises a plurality of sample groups, each sample group at least comprises two sample images, and different sample images in each sample group are images acquired by different image acquisition modules on the same target object. Based on the composition of the sample group, the aforementioned first content features include: carrying out common feature extraction processing on a first sample image in a sample group to obtain semantic feature information corresponding to the first sample image; the semantic feature information includes: one or two of the five-sense organ characteristic information of the target object and the facial depth information of the target object. Wherein, the five sense organs characteristic information of the target object includes but is not limited to: feature information for describing the position of the five sense organs of the target object, feature information for describing the size of the five sense organs of the target object, feature information for describing the shape of the five sense organs of the target object, and the like, and face depth information of the target object is used for describing the face depth of the target object.

The sample group includes two face images acquired by two image acquisition devices under different scenes of the same face, the two face images have different imaging styles, facial expressions are different, the background where the face is located and the decoration (such as hairstyle, clothes, etc.) are different, but the first content feature is semantic feature information common to the two face images, such as feature information of the position/shape of the five sense organs, etc. Through the images of different imaging styles acquired by different image acquisition modules, unified semantic feature information of a target object under different imaging styles can be learned in the model training process, and then the accuracy of the model in carrying out style conversion processing on the images of multiple styles can be improved. In addition, based on the acquisition sources of sample images in sample groups, one sample group in the sample set can correspond to one target object, the target objects corresponding to different sample groups can be different, more characteristics of the target objects can be learned in the model training process through images of different target objects, and generalization of the model is improved.

S402, training a target model through a sample set to obtain an image processing model.

The object model is understood to be an initial image processing model having initial image generation capabilities, illustratively an countermeasure generation network such as a GAN or BigGAN model that can be used to generate images. Based on the initial image generation capability of the target model, the image generated by the target model has a certain deviation in the reduction degree of the second imaging style, the common features and the like, so that the target model needs to be trained by adopting sample images comprising different imaging styles. It can be understood that training the target model is an iterative process, and parameters of the target model can be continuously updated in the iterative process, so that an image processing model can be obtained when convergence conditions are satisfied. The convergence conditions include, but are not limited to: the loss change value of the training loss in the continuous iterations is within the loss threshold, the iteration times meet the preset iteration times, the training loss reaches the preset loss value, and the like.

The image processing model and the target model may be identical in model structure and model parameters (e.g., weight parameters) may be different. The image processing model has better image generation capability than the target model and can be used to better transform the target image to enhance the generation quality of the reference image. In one implementation, the image processing apparatus is capable of training the target model in units of sample sets including sample groups, and in particular, the target model may be trained based on a first content feature determined by a first sample image in a sample group and a second content feature determined by a second sample image in the same sample group.

In one embodiment, the image processing apparatus may specifically execute the contents shown in (1) to (3) below when training the object model by the sample set.

(1) Semantic feature information is obtained from a first sample image included in a target sample group in a sample set through a first model.

The first model is obtained by carrying out knowledge distillation processing based on the second model, and the first model is used for carrying out common feature extraction processing of the sample image. Knowledge distillation is one way to obtain efficient small-scale networks, the main idea of which is to migrate "knowledge" in a complex teacher model (also called teacher's network) with strong learning ability into a simple student model (also called student's network). That is, the first model is a trained model with the ability to extract common features of the sample image. The target sample set is a sample set selected from a sample set, and may include at least a first sample image and a second sample image. The first model may be called to perform a common feature extraction process on the first sample image, so as to obtain semantic feature information of the first sample image, where the semantic feature information may be understood as feature information shared between the first sample image and the second sample image at a semantic level, and may be used to describe image content in the first sample image and image content in the second sample image that meets a consistency condition, and represent a highly unified semantic correspondence between the first sample image and the second sample image, for example, the sample image is a face image, where the semantic correspondence includes: facial location, facial depth, and the like.

In one embodiment, before performing step (1) above, knowledge distillation processing may be performed based on the second model to obtain the first model, which specifically includes what is shown as ①-③ below.

① And calling the student network to perform feature extraction processing on the first sample image in the target sample group to obtain a first content feature of the first sample image, and calling the second model to perform feature extraction processing on the first sample image in the target sample group to obtain a second content feature of the first sample image.

In a specific training process, the second model in the application can be a teacher model (or can be called a teacher network), the second model is a pre-trained model, and compared with the student network (or the first model), the second model is a more complex model, and the parameter quantity of the second model is far greater than that of the student network (or the first model), so that the second model has better class prediction capability and can be used for guiding the training direction of the student network. The first sample image can be respectively input into the student network and the teacher network, and the first content characteristics of the first sample image output by the student network and the second content characteristics of the first sample image output by the teacher network are respectively obtained through respective characteristic extraction processing. The first content features of the first sample image obtained through the student network include semantic feature information, but compared with the semantic feature information of the first sample image obtained through the first model, the semantic feature information of the first sample image obtained through the student network may not be comprehensive and accurate enough for the common image object features of the target object among different sample images, but are adjusted in the continuous adjustment process along with the parameters of the student network, so that when the first model is obtained, more accurate semantic feature information extracted from the first sample image by the first model is obtained. The second content feature of the first sample image is a high-level content feature that can be used to describe a detail feature of the target object in the first sample image, which can be used to describe image details of the target object in the first sample image, such as: skin texture of a face in a face image.

② And calling a second model to perform feature extraction processing on a second sample image in the target sample group to obtain a second content feature of the second sample image.

Because the second sample image and the first sample image are sample images belonging to the same sample group, and the two sample images are images acquired by different image acquisition devices on the same target object, the second sample image can be used as a reference of the first sample image, and the learning direction of the student network is guided through the highly uniform semantic correspondence of the sample images. The second content features of the second sample image are high-level content features, and can be used for describing detail features of the target object in the second sample image. It should be noted that the second model for feature extraction of the second sample image and the second model for feature extraction of the first sample image may be two mutually independent teacher networks having the same structure and the same parameters, or may be the same teacher network.

③ And carrying out knowledge distillation processing aiming at the student network according to the second content characteristics of the first sample image and the second content characteristics of the second sample image obtained by the second model and the first content characteristics of the first sample image obtained by the student network so as to obtain a first model corresponding to the student network.

The second content features of the first sample image and the second content features of the second sample image obtained through the second model can be used as guiding references of the student network so as to abandon the original imaging style of the first sample image in the training process and pay attention to the actual image content in the first sample image. The first model is a trained student network obtained through a knowledge distillation mode, knowledge possessed by the second model is taught to the student network in the knowledge distillation processing process, so that the first model obtained through training can better extract common semantic feature information, and a semantic corresponding relation with the second sample image is better obtained.

The content ①-③ is that first content features of the first sample image are extracted through the student network, second content features of the first sample image and the second sample image are respectively extracted through the second model, and knowledge distillation for the student network is performed based on the content features, so that the first model with better feature extraction capability is obtained. The content features extracted from the sample images of different imaging styles corresponding to the same target object by adopting the second model can be used for guiding the training of the student network, so that the first content features are extracted more accurately; noise points generated in the image generation process due to the difference of different imaging styles can be removed through knowledge distillation, and the semantic corresponding relation with high uniformity is distilled out as low-level content to be used as priori knowledge of the subsequent generated image.

It can be appreciated that in the training process, in addition to training the student network by the target sample group in the sample set, for other sample groups in the sample set, knowledge distillation processing for the student network can be performed according to the flow shown in ①-③ to continuously train the student network, and when the trained student network has the feature extraction capability meeting the expectation, the training process can be used as the first model.

In a possible implementation manner, when the image processing device executes the content shown in ③, knowledge distillation may be specifically performed based on the content features of each sample image as follows:

First, a content feature satisfying a feature consistency condition between a second content feature of the first sample image obtained by the second model and a second content feature of the second sample image may be used as a reference content feature of the first sample image.

Specifically, the image processing apparatus may compare the content features of the two sample images obtained by the second model for consistency, and then use the content features satisfying a consistency condition as the reference content features of the first sample image, wherein the consistency condition may be satisfied: the feature similarity is larger than a preset similarity threshold, the two content features are completely consistent, and the like. The reference content characteristics of the first sample image are common content characteristics after screening the second content characteristics of the first sample image based on the second content characteristics of the second sample image through consistency comparison, and the learning of the student network can be guided based on the reference content characteristics, so that semantic characteristic information common to the images can be better extracted and used as prior information of image generation.

Then, a target loss corresponding to the first sample image during the retorting process is determined based on the reference content feature of the first sample image and the first content feature of the first sample image obtained by the student network.

In one embodiment, the specific implementation of the determination of the target loss may include the following: according to first content characteristics of a first sample image obtained by a student network, respectively carrying out category prediction processing according to different distillation temperatures T to obtain a first prediction result of the first sample image and a second prediction result of the first sample image; performing category prediction processing based on the reference content characteristics of the first sample image through the second model to obtain a soft tag of the first sample image, and obtaining a hard tag set by the first sample image; determining a first prediction loss of the first sample image according to the first prediction result and the soft tag, and determining a second prediction loss of the first sample image according to the second prediction result and the hard tag; and determining a target loss corresponding to the first sample image in the knowledge distillation process by using the first predicted loss and the second predicted loss.

Specifically, the first content feature of the first sample image obtained according to the student network may be subjected to class prediction processing by a classifier, where the classifier refers to a network with normalization (softmax) capability, and different prediction results may be output according to different distillation temperatures T. Illustratively, the distillation temperature t=t used to output the first prediction result, the distillation temperature t=1 used to output the second prediction result, and when T is 1, the classification prediction uses the original softmax function. The second content characteristic of the first sample image obtained according to the second model is the same as the distillation temperature used for predicting the first prediction result, e.g. t=t. The first prediction result (Soft prediction) obtained through the class prediction process may be used to indicate the probability that the target object in the first sample image belongs to each class, and the second prediction result (Hard prediction) may be used to indicate whether the target object in the first sample image is the target class. Soft labels (Soft labels) are probability labels that describe the classification probabilities of the target object in the first sample image under each category, and Hard labels (Hard labels) are deterministic labels that describe which category the target object in the first sample image should belong to and not which category. Illustratively, the first sample image is a face image, then the soft label predicted by the second model is [0.8,0.2], which indicates that the face in the face image is the face of the user U1 with a probability of 0.8, the face of the user U1 with a probability of 0.2, and the hard label is [1,0], which indicates that the face in the face image is the face of the user U1, but not the face of the user U2.

The soft label based on the first prediction result and the first sample image may perform a loss calculation to obtain a first prediction loss (or may be referred to as a distillation loss, distillation loss) of the first sample image, and the hard label based on the second prediction result and the first sample image may perform a loss calculation to obtain a second prediction loss (or may be referred to as a student loss) of the first sample image. In the loss calculation process, the corresponding loss value can be calculated according to a loss function such as cross entropy loss or KL divergence, and finally the obtained first predicted loss and second predicted loss can be weighted and summed to obtain the target loss of the first sample image.

In the method for determining the target loss, the learning of the small model (corresponding to the student network) is guided through the two types of labels, namely the soft label and the hard label, so that the knowledge of the large model (corresponding to the second model) can be better captured. The soft labels embody the richness of the knowledge, so that the student network can learn not only the knowledge of the characteristic information, but also the knowledge of the information between classes, and further the student network learns the generalization capability of the second model.

Finally, when the target loss meets the parameter adjustment condition, parameter adjustment is carried out on the student network so as to obtain a first model corresponding to the student network.

In a possible embodiment, the parameter adjustment condition refers to a condition that is required to be reached to adjust the parameters of the student network, and the target loss meeting the parameter adjustment condition may refer to any one of the following: the target loss is larger than a preset loss threshold, and the difference between the target loss obtained by the current calculation and the target loss obtained by the previous iteration is larger than a preset difference threshold. If the target loss meets the parameter adjustment condition, indicating that the learning of the student network does not achieve the desired training effect and does not have the desired feature extraction capability, the parameters of the student network may be adjusted to obtain an adjusted student network (including the adjusted parameters), and the above-described steps ①-③ are repeatedly performed based on the adjusted student network. Through continuous iterative training, parameters of the student network can be updated continuously, and finally a first model corresponding to the student network is obtained.

It can be understood that when the target loss does not meet the parameter adjustment condition, it is indicated that the student network can achieve the expected training effect and has better feature extraction capability, and the student network obtained by the current training can be directly used as the first model without parameter adjustment.

According to the knowledge distillation process based on the content characteristics of each sample image, the common content characteristics among the sample images can be determined through content consistency comparison among different sample images of the same sample group, and training of a student network is guided based on the common content characteristics, so that semantic characteristic information is better extracted to train a target model, and training effects are guaranteed.

For the above ①-③ description of content, a schematic diagram of processing logic as shown in fig. 5a and 5b may be provided. As shown in fig. 5a, the first sample image and the second sample image are different face images acquired by different cameras for the same user. The first sample image may be input into a teacher network (Et) and a student network (Es), and further processed by the teacher network (Et) to obtain second content features of the first sample image, and processed by the student network (Es) to obtain first content features of the first sample image, and the second sample image may be processed by the teacher network (Et) to obtain corresponding second content features, which are high-level content features. Then, content consistency comparison can be performed based on the high-level content features corresponding to the two face images, so that reference content features are obtained, and knowledge distillation is performed based on the reference content features and the first content features of the first sample image obtained by the student network. The knowledge distillation process is shown in fig. 5b, the first sample image is input into a student network (student (distilled) model or referred to as distillation model/student model) and a teacher network (teacher model or referred to as teacher model), and corresponding prediction results are obtained through category prediction, further target losses are obtained through calculation based on the prediction results and labels, and then parameters of the student network are adjusted based on the target losses. Thus, in the block diagram shown in fig. 5a, the target loss lx acts on the full connection layer, and when the target loss does not meet the parameter adjustment condition, the semantic feature information in the first content feature extracted by the student network is processed by the full connection layer to obtain the semantic feature information for training the target model, where the semantic feature information is illustrated by, for example, a face key point shown in fig. 5a (illustrated by an opaque dot corresponding to a face position in fig. 5 a), and it is understood that the description (a face schematic line drawing of fig. 5 a) on the semantic feature information part in fig. 5a is merely a schematic drawing for easy understanding, and the output after passing through the full connection layer in fig. 5a mainly includes the semantic feature information, and does not include some information such as some face details.

It should be noted that, before the training of the student network, the first content feature of the first sample image obtained through the student network is not used as prior information for training the target model, but after the training, for example, when the value of minimizing the target loss reaches the preset loss threshold, the extracted semantic feature information can be used as prior information for training the target model.

(2) And carrying out feature extraction processing on a second sample image included in the target sample group through a second model to obtain detail features of the second sample image.

In particular, the detail features of the second sample image may be used to describe the detail content of the second sample image. In one practical approach, since the second model is a pre-trained model, neither the knowledge distillation process for the student network nor the training process for the target model is altered, e.g., is the same teacher model. Therefore, the image processing device can also directly acquire the detail features contained in the second content features of the second sample image extracted from the second sample image through the second model in the knowledge distillation stage, without repeatedly executing feature extraction processing, and the overall training progress of the target model can be quickened while saving processing resources.

(3) Training a target model according to semantic feature information of the first sample image and detail features of the second sample image and according to the test images set for the target sample group to obtain an image processing model.

Because the first model and the second model are both trained models, semantic feature information of the first sample image and detail features of the second sample image which are extracted correspondingly are relatively accurate feature information, and when the target model is trained, the semantic feature information of the first sample image can be used as prior information (or known as prior knowledge), and the detail features of the second sample image can be used as supplement to generate a new sample image. In addition, each sample group in the sample set is provided with one or more test images, so that when the target sample group is adopted for training, the test images arranged for the target sample group can be used as references of a target model in the training process to teach training of the target model, and the image processing model can be understood as a trained target model.

In one embodiment, the image processing apparatus may be implemented as follows in steps 3.1 to 3.5 when executing the content shown in (3) above.

And 3.1, calling a target model, and generating a predicted image based on semantic feature information of the first sample image and detail features of the second sample image.

The target model is a model capable of realizing image generation, for example, can be a BigGAN model with good effect in the field of image generation problems, and the BigGAN model comprises a generation network (G) and a discrimination network (D). During model training, the discrimination network may assist in generating training for the network. Exemplary schematic diagrams of the related object model shown in fig. 5c and fig. 5d, specifically, schematic diagrams of the structure of the related BigGAN model for generating the network, fig. 5c is a typical structure diagram of the BigGAN model for generating the network (G), fig. 5d is a schematic diagram of the network structure for generating a residual module in the network, and fig. 5e is a schematic diagram for distinguishing a residual block in the network.

In one embodiment, the second model for feature extraction processing of the second sample image may be dynamically cross-layer connected with the target model. Through dynamic cross-layer connection, the detail features extracted from the second sample image can be given to the target model in real time, so that the speed of inputting the detail features of the second sample image to the target model is improved. The semantic feature information extracted from the first sample image by calling the first model can also be input into the target model, so that the target model can refer to the detail feature of the second sample image and the semantic feature information of the first sample image for image generation, wherein the semantic feature information of the first sample image can be used as priori information of the generated image, the detail feature of the second sample image can be used as supplement, and the image acquired by different image acquisition modules for the same target object is referred to during image generation, so that a predicted image is generated more accurately, the predicted image is an image with a second imaging style, is close to the second sample image in image detail, and is close to the first sample image in semantic feature.

In one embodiment, before generating the predicted image through the target model, the image processing device may perform a spectrum normalization process (Spectral Normalization) on parameters of the target model, and through the spectrum normalization process, network parameters may be normalized, so as to improve the stability of training the target model and the generation quality of the predicted image. The method comprises the following steps: singular value decomposition is carried out on parameters of each layer of network in the target model, so that a parameter decomposition result corresponding to each layer of network is obtained, wherein the parameter decomposition result comprises a singular value matrix; and carrying out normalization processing on the singular value matrix included in the parameter decomposition result corresponding to each layer of network, and obtaining a normalized parameter matrix according to the normalized singular value matrix, wherein the normalized parameter matrix comprises normalized parameters corresponding to the corresponding layer of network.

Specifically, singular value decomposition (Singular Value Decomposition, SVD) can be performed on parameters of each layer of network of the target model, so as to obtain a corresponding parameter decomposition result of each layer of network. Illustratively, the object model may include a multi-layer neural network, and the weighting matrix W (including the weighting parameters) included in the neural network may be subjected to SVD decomposition, thereby decomposing the weighting matrix into two orthogonal matrices (denoted as U and V) and one diagonal matrix S. The diagonal matrix S is a singular value matrix, and elements on the diagonal line are singular values of the weight matrix. Only the singular value matrix in the parametric decomposition result may then be normalized, scaling the element values contained in the singular value matrix to a predefined range, e.g. [0,1]. Through normalization processing, the largest singular value in the singular value matrix can be limited to 1, and then the condition of 1-Lipschitz (Lipohshing continuous condition) is satisfied. And reconstructing a normalized parameter matrix according to the normalized singular value matrix according to the inverse process of the decomposition. For example, after the singular value matrix S is normalized, a normalized singular value matrix S 'is obtained, and then the S' and the two orthogonal matrices U and V may be multiplied respectively to obtain a normalized weight matrix W ', and further, the weight parameter included in W' is used as a new weight in the corresponding network of the target model to train.

It can be understood that after the parameters in the target model are updated, the parameters of each layer of network in the updated target model can be normalized in the above manner, and then the target model containing the normalized parameters is called to generate a predicted image, so that the target model can meet the lipschz continuity, and training can be performed better. In one possible implementation, the parameters in the target model are divided by the maximum singular value (a value less than 1) of the parameters after normalization in the above manner each time the parameters are updated, so that the maximum stretch coefficient of each layer of network in the target model for the input x does not exceed 1. After Spectral Norm (i.e., spectral normalization), each layer of the neural network has a gl (x) weight that satisfies (gl (x) -gl (y))/(x-y). Ltoreq.1, and naturally also satisfies liphatz continuity for the entire target model f (x) =gn (gN-1 (… g1 (x) …)).

Through the spectrum normalization processing, the gradient disappearance or explosion problem in the training process can be avoided, the convergence speed of the target model and the diversity of the predicted image are improved, and the training stability of the model and the quality of the generated image are enhanced through adjusting the size of parameters.

Step 3.2, a first test image and a second test image set for the target sample group are acquired.

After obtaining the predicted image (denoted as) Then, since the target model does not achieve a better image generation capability, the authenticity and accuracy of the predicted image needs to be evaluated by means of some other image. Here, the test image set for the sample group may be acquired first, including a first test image (denoted as y) and a second test image (denoted as x). The generation of the prediction image is based on the fact that features of different sample images in different dimensions are referenced, the first test image is matched with the first sample image, and the second test image is matched with the second sample image. Matching here includes, but is not limited to: the imaging style is the same, or is a copy of the corresponding sample image. The second test image may be a copy of the second sample image, the first test image being an image actually acquired of the target object, the second test image having the same imaging style as the first sample image. The authenticity and accuracy of the predictive image can then be evaluated by means of the first test image and the second test image, respectively.

And 3.3, calling a judging network, and judging the predicted image according to the first test image to obtain a judging result, wherein the judging result is used for indicating the authenticity of the target object in the predicted image.

After the target model generates the resulting predicted image, it may be input into a discriminant network (or discriminant). The discrimination network may be used to determine whether the input (i.e., the predicted image) is real data or data generated by a model (e.g., the generation network), i.e., whether it is true or false. The prediction image can be discriminated by means of the first test image, and in a specific discriminating process, the similarity between the prediction image and the first test image can be compared, so that a discriminating result can be obtained. The discrimination result obtained by the discrimination process may be a scalar, which may be a score or probability, or the like. Alternatively, the larger the scalar, the higher the authenticity of the predicted image, and the higher the authenticity of the target object in the predicted image.

It can be appreciated that the discrimination network can also be spectrally normalized in a similar manner prior to generating the predicted image, so that the discrimination network satisfies the lipschz continuity, resulting in a faster convergence rate of the overall training. In addition, after the parameters in the discrimination network are updated through the loss corresponding to the training, the parameters in the discrimination network can be further normalized to ensure the stability of the training.

And 3.4, carrying out content consistency comparison on the predicted image and the second test image to obtain a comparison result. The comparison result is used for indicating whether the image content between the predicted image and the second test image meets the consistency condition.

Wherein meeting the consistency condition includes, but is not limited to: the predicted image and the second test image have consistent image contents, the proportion of the consistent image contents is larger than a preset proportion threshold value, the two image contents are identical, or the feature similarity is larger than a preset similarity threshold value, the features of the two content are identical, and the like. Through content consistency comparison, the accuracy of the image generated by the target model can be evaluated, and the confidence of the model can be determined. When the confidence of the target model is high enough, the generated prediction image is accurate enough, and the discrimination network can discriminate the prediction image as true by false true.

In one embodiment, when content consistency comparison is performed on the predicted image and the second test image, the second model can be called to perform feature extraction processing on the predicted image to obtain content features of the predicted image, and the second model is called to perform feature extraction processing on the second test image to obtain content features of the first test image; and carrying out consistency comparison on the content characteristics of the predicted image and the content characteristics of the second test image to obtain a comparison result.

In particular, the second model may be used to better extract content features of the predicted image, and the content features may be used to describe detail features of the predicted image, with similar effects as the aforementioned second content features. The content features extracted from the predicted image by the second model can be used for describing the detail features of the predicted image, and the content features extracted from the second test image can be used for describing the detail features of the second test image. If the second test image is a copy of the second sample image, the second content features of the second sample image may be directly obtained as content features of the second test image that are consistently compared with the content features of the predicted image. In one implementation manner, when consistency comparison is performed on the content features, specifically, the similarity between the content features of the predicted image and the content features of the second test image may be calculated, if the similarity is greater than a preset similarity threshold, a comparison result for indicating that the image content between the predicted image and the second test image meets the consistency condition may be obtained, and otherwise, a comparison result for indicating that the image content between the predicted image and the second test image does not meet the consistency condition may be obtained.

In the above manner, the feature extraction processing is performed on the predicted image and the second test image by means of the second model, so that more accurate high-level content features can be obtained, and further content consistency comparison is realized based on comparison between the content features of the predicted image and the second test image, so that a reliable comparison result can be obtained, and the quality of the predicted image can be accurately estimated by the comparison result, so that better iterative training is performed on the target model based on the comparison result.

And 3.5, updating parameters of the target model according to the comparison result and the discrimination result to obtain an image processing model.

In the process of training the target model, the training target of the target model is to generate an image which is as real as possible, while the training target of the discrimination network is to discriminate the image generated by the target model as far as possible, and the image processing model can be trained finally through the training of the countermeasure generation. In one implementation, in each iteration process, parameters of the network can be fixedly generated, and parameters of the discrimination network are updated; specifically, after the discrimination result of the discriminator is obtained, the parameters of the discrimination network may be updated according to the discrimination result. And then the parameters of the discrimination network can be fixed, the target model can be updated, and particularly, the parameters of the target model can be updated according to the comparison result, the loss determined based on the prediction image and the discrimination result, and the higher the authenticity of the prediction image indicated by the discrimination result output by the discrimination network is, the better the target model is by adjusting the parameters of the target model.

In one possible embodiment, after generating the predicted image by the target model, the manner of determining the loss based on the predicted image may be as follows: determining generation loss of the predicted image according to the predicted image, the weight parameters of the target model and the first sample image, and determining a regularization term based on parameters included in the target model; parameters of the target model are updated based on the generation loss of the predicted image and the regularization term.

When determining the generated loss, a corresponding loss function may be employed for calculation. Illustratively, hinge Loss (Hinge Loss), perceptron Loss, etc. may be employed. When hinge loss is adopted, a parameter loss term is calculated according to the weight parameter of the target model, then the loss sum value of the predicted image under each characteristic dimension is calculated according to the predicted image, the weight parameter of the target model and the first sample image, and the generated loss can be obtained based on the parameter loss term and the loss sum value. Furthermore, a regularization term may be determined based on parameters included by the target model, which may prevent the target model from overfitting. The specific expressions for the hinge loss and regularization term are as follows:

Wherein N represents the feature length, y _i represents the i-th feature of the predicted image, w represents the weight parameter of the target model, and x _i represents the i-th feature of the second test image x; representing L2 regularization,/> Modulo representing the weight parameter, z representing the regularization term,/>Representing different values of the regularization term.

In the training process, not only classification is correct, but also when the confidence of the predicted image is high enough, the generated loss can be close to 0, and the relationship between the hinge loss and the classification result is shown in fig. 6 a. The loss value of the hinge loss (loss size) in the direction of the erroneous classification (incorrectly classified) is larger, and the loss value in the direction of the correct classification (correctly classified) is reduced with the improvement of the classification accuracy, and the loss value is approaching 0 from the distance (distance from boundary) from the boundary. Compared with the loss of the perceptron, the hinge loss is adopted to have higher requirements on the learning of the target model and is stricter compared with the predicted image generated under the loss of the perceptron and the predicted image generated under the loss of the hinge. Parameters of the target model may be updated according to the generation loss and regularization terms described above.

Through the steps 3.1-3.5, semantic feature information of the first sample image can be used as prior information to jointly act with detail features of the second sample image to generate a predicted image, then generation quality of the predicted image is respectively estimated from two dimensions of reality and accuracy through different test images, the generation quality can be used for reflecting image generation capacity of a target model, and an image processing model with better image generation capacity can be obtained through continuous updating of parameters of the target model.

Based on the description of the above-described step 3.1-step 3.5, a training frame diagram of the object model is provided as shown in fig. 6 b. As shown in fig. 6b, the output (i.e. semantic feature information) obtained by processing the first sample image through the trained student network (Es) may be input into the generation network (G), the output (i.e. the second content feature) obtained by processing the second sample image through the teacher network (Ec) may be input into the generation network (G) through a dynamic cross-layer connection, and then the generation network may generate the prediction image based on the two inputs, so that fusion calculation of the images acquired by the different image acquisition modules may be realized. Then, the predicted image can be respectively input into a judging network (D) and a teacher network, the first test image (y) is also input into the judging network to assist in judging whether the predicted image is true or false, namely whether the predicted image is a true image or an image generated by a generating network, and the second test image (x) is also input into the teacher network to assist in judging whether the predicted image has a highly uniform semantic correspondence. And updating parameters of the generated network through respective output results of the teacher network and the discrimination network, and finally obtaining an image processing model. The image processing model can perform style conversion processing on the first sample image to obtain a reference image (marked as）。

It will be appreciated that the line images such as the target image, the reference image, the first/second sample image, the predicted image, and the line images depicted by the semantic feature information are merely schematic images drawn for understanding, and in practical applications, the images are images having specific forms or formats, or feature images, and the formats may be, for example, pictures or screenshots of videos in some common image formats, etc. in the present application, the present application is not limited thereto.

In one possible embodiment, the target model is further trained by adding random noise vectors in the training process, wherein the random noise vectors are generated according to random numbers sampled from normal distribution. The random noise vector can be used together with semantic feature information of the first sample image and detail features of the second sample image in the generation process of the predicted image. The larger the random noise vector inputted into the target model includes the random number variation range, the larger the variation of the generated predicted image on the standard template, the more the diversity of the predicted image is, but the authenticity may be reduced. Therefore, to further increase the robustness of model training, a truncated skill may be used to stage random numbers in the random noise vector that satisfy a normal distribution, thereby enabling the model to increase the authenticity of generating the predicted image. In one implementation, the random noise vector may be truncated using a truncation parameter to obtain a truncated noise vector, where the truncated noise vector includes a plurality of random noise values, and each random noise value is within a preset range. The truncated noise vector is used to generate a predicted image. A truncated noise vector is understood to be a noise vector Z generated using a truncated normal distribution N (0, 1) of random numbers, and in particular, a sampling range of a desired random number in the normal distribution may be determined based on a truncated parameter, and if the random noise vector includes a random number out of a certain range (i.e., a range interval determined by the truncated parameter), the random number may be resampled so that the random number falls within the range interval determined by the truncated parameter. That is, the random noise vector is truncated and the random numbers that exceed a certain specified threshold (here corresponding to the truncation parameter) are resampled, which can improve the quality of the individual predicted image.

According to the image processing method provided by the embodiment of the application, model training can be carried out on sample images acquired by the same object through different image acquisition modules, the content of the sample images with different imaging styles, which is consistent in semantic relation, is extracted as priori information in a knowledge distillation mode in the training process, and the target model is effectively trained, so that the target model learns the highly unified semantic corresponding relation of the target object in different imaging styles. By means of fusion calculation of images acquired by the image acquisition modules, an image processing model for converting styles and generating new images can be obtained, and the model has the better capability of extracting a baseline style, can effectively process target images, and is beneficial to subsequent recognition processing.

A framework diagram of the image processing method shown in fig. 7 is provided herein based on the overall training flow of the object model set forth in the embodiment shown in fig. 4 and the processing flow of the object image shown in the embodiment shown in fig. 2. As shown in fig. 7, the training phase includes a training phase and an reasoning phase, in which a training sample is extracted through a baseline image, and in particular, prior information can be obtained through knowledge distillation processing, and then an image is generated based on the prior information to train a target model, so as to obtain an image processing model, and the image processing model can be used for generating the baseline image. In the inference phase, target conversion samples (which can be understood as target images) can be processed, which are actual industrial data, such as face images in a face-brush business scenario. The target conversion sample is processed by adopting an image processing model through a multi-element style conversion algorithm, a baseline image can be generated, the generated baseline image contains common characteristics of a target object, for example, contains a highly uniform semantic corresponding relation of a human face, and then the baseline image is identified by adopting an identification model, so that an image identification result is obtained.

By adopting the image processing method, under the condition of biological recognition payment service, each hardware module for collecting biological information is replaced, new-style image data does not need to be collected again, and a recognition model does not need to be trained again.

Based on the description of the embodiment of the image processing method, the embodiment of the application also discloses an image processing device; the image processing apparatus may be a computer program (including program code) running in a computer device, and the image processing apparatus may perform the steps of the method flow shown in fig. 2 or fig. 4. Referring to fig. 8, the image processing apparatus may operate the following units:

an acquiring unit 801, configured to acquire a target image to be processed, where the target image has a first imaging style;

A processing unit 802, configured to perform style conversion processing on the target image according to a style conversion rule, so as to obtain a reference image corresponding to the target image, where the reference image has a second imaging style; the style conversion rule is obtained by training and learning based on first content features of sample images of the sample group and second content features of sample images of the sample group, wherein the first content features are used for describing common image object features of target objects in the sample images among different sample images, and the second content features are used for describing detail features of the target objects in the sample images;

The processing unit 802 is further configured to perform recognition processing on the reference image to obtain an image recognition result;

In one embodiment, the style conversion rule includes an image processing model, and the processing unit 802 is further configured to:

In one embodiment, the processing unit 802 is specifically configured to, when training the target model through the sample set to obtain the image processing model:

In one embodiment, the processing unit 802 is further configured to, prior to obtaining, by the first model, semantic feature information from a first sample image included in a target sample group in the sample set:

In one embodiment, the processing unit 802 is specifically configured to, when performing the knowledge distillation processing for the student network according to the second content feature of the first sample image obtained by the second model and the second content feature of the second sample image obtained by the student network, and according to the first content feature of the first sample image obtained by the student network:

In one embodiment, the processing unit 802 is specifically configured to, when determining, based on the reference content feature of the first sample image and the first content feature of the first sample image obtained by the student network, a target loss corresponding to the first sample image during the distillation process:

In one embodiment, the processing unit 802 is specifically configured to, when training the target model according to the semantic feature information of the first sample image and the detail feature of the second sample image and according to the test image set for the target sample group, obtain an image processing model:

In one embodiment, the processing unit 802 is specifically configured to, when performing content consistency comparison on the predicted image and the second test image to obtain a comparison result:

In one embodiment, the processing unit 802 is further configured to, before invoking the target model, generate the predicted image based on the semantic feature information of the first sample image and the detail features of the second sample image:

In one embodiment, the processing unit 802 is further configured to, after invoking the target model, generate a predicted image based on the semantic feature information of the first sample image and the detail features of the second sample image:

In one embodiment, the target model is further added with random noise vectors for training in the training process, and the random noise vectors are generated according to random numbers obtained by sampling from normal distribution; the processing unit 802 is further configured to:

the truncated noise vector is used to generate a predicted image.

According to the embodiment of the application, aiming at the target image to be processed, the target image is subjected to style conversion processing through the style conversion rule, the original imaging style can be converted into a new imaging style, the new image is obtained to carry out subsequent recognition processing, and recognition based on the target image is further completed. Therefore, the image of any imaging style can be effectively identified, and the effectiveness of image identification after imaging style conversion is improved. In the formation process of the style conversion rule, sample images of different imaging styles formed by different image acquisition modules aiming at a target object can be referred to, common features and detail features among the sample images of different imaging styles are learned, and further, a reliable style conversion rule is obtained so as to support effective conversion of the target image, key information in the target image is reserved while the style is converted, and therefore more accurate identification can be performed.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application also provides image processing equipment. Referring to fig. 9, the image processing apparatus includes at least a processor 901, an input interface 902, an output interface 903, and a computer storage medium 904. Wherein the processor 901, input interface 902, output interface 903, and computer storage medium 904 within the image processing device may be connected by a bus or other means. The computer storage medium 904 may be stored in a memory of the image processing apparatus, the computer storage medium 904 for storing a computer program comprising program instructions, and the processor 901 for executing the program instructions stored by the computer storage medium 904. The processor 901 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of the image processing device, which are adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function.

In one possible implementation, the processor 901 of an embodiment of the present application may be configured to perform:

In one embodiment, the style conversion rule includes an image processing model, and the processor 901 is further configured to:

In one embodiment, the processor 901, when training the target model through the sample set, is specifically configured to:

In one embodiment, the processor 901, prior to obtaining the semantic feature information from the first sample image included in the target sample group in the sample set by the first model, is further configured to:

In one embodiment, the processor 901 is specifically configured to, when performing the knowledge distillation process for the student network according to the second content feature of the first sample image obtained by the second model and the second content feature of the second sample image obtained by the student network, and according to the first content feature of the first sample image obtained by the student network:

In one embodiment, the processor 901 is specifically configured to, when determining, based on the reference content feature of the first sample image and the first content feature of the first sample image obtained by the student network, a target loss corresponding to the first sample image during the distillation process:

In one embodiment, the processor 901 is specifically configured to, when training the target model according to the semantic feature information of the first sample image and the detail feature of the second sample image and according to the test image set for the target sample group, obtain an image processing model:

In one embodiment, the processor 901 is specifically configured to, when performing content consistency comparison on the predicted image and the second test image to obtain a comparison result:

In one embodiment, the processor 901, before invoking the object model, is further configured to, before generating the predicted image based on the semantic feature information of the first sample image and the detail features of the second sample image:

In one embodiment, the processor 901, after invoking the object model, generates a predicted image based on the semantic feature information of the first sample image and the detail features of the second sample image, is further configured to:

In one embodiment, the target model is further added with random noise vectors for training in the training process, and the random noise vectors are generated according to random numbers obtained by sampling from normal distribution; the processor 901 is also for:

the truncated noise vector is used to generate a predicted image.

Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, and the computer readable storage medium stores a computer program, where the computer program includes program instructions, when executed by a processor, can perform the method in the embodiment corresponding to fig. 2 and fig. 4, and therefore, a detailed description will not be given here. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed on one computer device or executed on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network.

According to one aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor executes the computer program, so that the computer device can perform the method in the corresponding embodiment of fig. 2 and fig. 4, and thus, a detailed description will not be given here.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The above disclosure is only a preferred embodiment of the present application, and it should be understood that the scope of the application is not limited thereto, but all or part of the procedures for implementing the above embodiments can be modified by one skilled in the art according to the scope of the appended claims.

Claims

1. An image processing method, comprising:

acquiring a target image to be processed, wherein the target image has a first imaging style, and is a face image in a payment scene;

Performing style conversion processing on the target image through an image processing model to obtain a reference image corresponding to the target image, wherein the reference image has a second imaging style; the image processing model is obtained by training and learning based on first content features of sample images of a sample group and second content features of the sample images of the sample group, wherein the first content features are used for describing common image object features of a target object in the sample images among different sample images, and the second content features are used for describing detail features of the target object in the sample images; the common image object features refer to object features common to the same target object in sample images of different imaging styles, and the common image object features do not change with the changes of the imaging styles; the reference image having the second imaging style is a baseline image;

performing identification processing on the reference image to obtain an image identification result; when the image recognition result indicates successful recognition, successful payment is performed based on the face image;

Each sample image of the sample group comprises images acquired by adopting different image acquisition modules aiming at the target object, and the sample images acquired by the different image acquisition modules have different imaging styles;

The training process of the image processing model comprises the following steps: obtaining a sample set, the sample set comprising a plurality of sample groups; training a target model according to semantic feature information of a first sample image in a sample group and detail features of a second sample image in the sample group and according to a test image set for each sample group to obtain an image processing model; the semantic feature information of the first sample image is obtained by performing feature extraction processing on the first sample image through a first model, and the detail feature of the second sample image is obtained by performing feature extraction processing on the second sample image through a second model; the semantic feature information includes: the facial feature information of the target object and the facial depth information of the target object;

Training a target model according to semantic feature information of a first sample image in a sample group and detail features of a second sample image in the sample group and according to a test image set for each sample group, wherein the obtaining of an image processing model comprises the following steps: invoking a target model, and generating a predicted image based on semantic feature information of the first sample image and detail features of the second sample image;

And after the invoking the target model, generating a predicted image based on semantic feature information of the first sample image and detail features of the second sample image, the method further comprises: determining generation loss of the predicted image according to the predicted image, the weight parameter of the target model and the first sample image, and determining a regularization term based on parameters included in the target model; updating parameters of the target model based on the generation loss of the predicted image and the regularization term; wherein the generation loss determination process includes: and calculating a parameter loss term according to the weight parameter of the target model, calculating the loss sum value of the predicted image under each characteristic dimension according to the predicted image, the weight parameter of the target model and the first sample image, and generating loss based on the parameter loss term and the loss sum value.

2. The method of claim 1, wherein the first model is obtained by performing a knowledge distillation process based on a second model, the first model being used to perform a common feature extraction process for the sample image.

3. The method of claim 2, further comprising, prior to feature extraction processing of the first sample image by a first model:

Invoking a student network to perform feature extraction processing on a first sample image in a sample group to obtain a first content feature of the first sample image, and invoking the second model to perform feature extraction processing on the first sample image in the sample group to obtain a second content feature of the first sample image;

invoking the second model to perform feature extraction processing on a second sample image in the sample group to obtain a second content feature of the second sample image;

4. The method of claim 3, wherein the second content features of the first sample image and the second content features of the second sample image obtained from the second model, and performing a knowledge distillation process for the student network based on the first content features of the first sample image obtained from the student network, comprises:

and when the target loss meets parameter adjustment conditions, carrying out parameter adjustment on the student network to obtain a first model corresponding to the student network.

5. The method of claim 4, wherein said determining a target loss corresponding to said first sample image during retorting based on said reference content characteristic of said first sample image and said first content characteristic of said first sample image obtained by said student network comprises:

According to the first content characteristics of the first sample image obtained by the student network, respectively carrying out category prediction processing according to different distillation temperatures T to obtain a first prediction result of the first sample image and a second prediction result of the first sample image;

Performing category prediction processing based on the reference content characteristics of the first sample image through a second model to obtain a soft tag of the first sample image, and obtaining a hard tag set by the first sample image;

And determining a target loss corresponding to the first sample image in the knowledge distillation process by adopting the first predicted loss and the second predicted loss.

6. The method of claim 1, wherein training the target model based on semantic feature information of a first sample image in a sample group, detail features of a second sample image in the sample group, and based on test images set for each sample group, to obtain an image processing model, comprises:

Acquiring a first test image and a second test image which are set for the sample group, wherein the first test image is matched with the first sample image, and the second test image is matched with the second sample image;

invoking a discrimination network, and carrying out discrimination processing on the predicted image according to the first test image to obtain a discrimination result, wherein the discrimination result is used for indicating the authenticity of a target object in the predicted image;

content consistency comparison is carried out on the predicted image and the second test image to obtain a comparison result, wherein the comparison result is used for indicating whether image content between the predicted image and the second test image meets a consistency condition or not;

7. The method of claim 6, wherein said comparing the content consistency of the predicted image and the second test image to obtain a comparison result comprises:

Invoking a second model to perform feature extraction processing on the predicted image to obtain content features of the predicted image, and invoking the second model to perform feature extraction processing on the second test image to obtain content features of the second test image;

8. The method of claim 6, wherein the invoking the target model is preceded by generating a predicted image based on semantic feature information of the first sample image and detail features of the second sample image, the method further comprising:

Singular value decomposition is carried out on parameters of each layer of network in the target model, and a parameter decomposition result corresponding to each layer of network is obtained, wherein the parameter decomposition result comprises a singular value matrix;

9. The method of claim 6, wherein the target model is further trained with the addition of random noise vectors during training, the random noise vectors being generated from respective random numbers sampled from a normal distribution; the method further comprises the steps of:

The truncated noise vector is used to generate a predicted image.

10. An image processing apparatus, comprising:

The acquisition unit is used for acquiring a target image to be processed, wherein the target image has a first imaging style and is a face image in a payment scene;

The processing unit is used for carrying out style conversion processing on the target image through the image processing model to obtain a reference image corresponding to the target image, wherein the reference image has a second imaging style; the image processing model is obtained by training and learning based on first content features of sample images of a sample group and second content features of the sample images of the sample group, wherein the first content features are used for describing common image object features of a target object in the sample images among different sample images, and the second content features are used for describing detail features of the target object in the sample images; the common image object features refer to object features common to the same target object in sample images of different imaging styles, and the common image object features do not change with the changes of the imaging styles; the reference image having the second imaging style is a baseline image;

the processing unit is also used for carrying out identification processing on the reference image to obtain an image identification result; when the image recognition result indicates successful recognition, successful payment is performed based on the face image;

The processing unit is used for acquiring a sample set when being used for training an image processing model, wherein the sample set comprises a plurality of sample groups; training a target model according to semantic feature information of a first sample image in a sample group and detail features of a second sample image in the sample group and according to a test image set for each sample group to obtain an image processing model; the semantic feature information of the first sample image is obtained by performing feature extraction processing on the first sample image through a first model, and the detail feature of the second sample image is obtained by performing feature extraction processing on the second sample image through a second model; the semantic feature information includes: the facial feature information of the target object and the facial depth information of the target object;

The processing unit is used for calling a target model in the process of training to obtain an image processing model, and generating a predicted image based on semantic feature information of the first sample image and detail features of the second sample image; and after the call target model generates a predicted image based on semantic feature information of the first sample image and detail features of the second sample image, the processing unit is further configured to: determining generation loss of the predicted image according to the predicted image, the weight parameter of the target model and the first sample image, and determining a regularization term based on parameters included in the target model; updating parameters of the target model based on the generation loss of the predicted image and the regularization term; wherein the generation loss determination process includes: and calculating a parameter loss term according to the weight parameter of the target model, calculating the loss sum value of the predicted image under each characteristic dimension according to the predicted image, the weight parameter of the target model and the first sample image, and generating loss based on the parameter loss term and the loss sum value.

11. A computer device comprising an input interface and an output interface, further comprising: a processor and a computer storage medium;

Wherein the processor is adapted to implement one or more instructions, the computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the image processing method of any of claims 1-9.

12. A computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the image processing method according to any one of claims 1-9.