CN113705301A

CN113705301A - Image processing method and device

Info

Publication number: CN113705301A
Application number: CN202110283292.4A
Authority: CN
Inventors: 彭健腾
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-11-26

Abstract

The embodiment of the application provides an image processing method and device, which are applied to the field of image processing. The method comprises the following steps: acquiring a sample image set for model training; carrying out image conversion processing on the sample image set to obtain an auxiliary image set; selecting a training data set of a target task from the sample image set and the auxiliary image set, and training a first model by adopting the training data set of the target task; selecting a training data set of an auxiliary task of a target task from the sample image set and the auxiliary image set, and training a second model by adopting the training data set of the auxiliary task, wherein the second model and the first model share the model structure and the model parameters of the feature extractor; and determining the trained first model as a target model for executing the target task. By the method and the device, the accuracy of image recognition can be improved.

Description

Image processing method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to the field of image processing technologies, and in particular, to an image processing method and apparatus.

Background

With the continuous development and evolution of deep learning, neural network models have been widely applied to the field of image processing, and are widely applied to practical applications such as neural network image recognition.

At present, the flow of an image recognition method is mainly a classification method based on deep learning, namely, a classification model based on the deep learning method is trained, and the type of an image is directly judged through the classification model. In the existing scheme, sample images used for model training are usually collected manually and labeled by manual analysis, and only a small number of sample images can participate in model training due to the small number of sample images carrying labels, so that the recognition effect of a model trained by a small number of sample images is not accurate enough.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, computer equipment and a storage medium, which can respectively train a first model and a second model according to a sample image set and an auxiliary image set, and improve the accuracy of image recognition.

In one aspect, an embodiment of the present application provides an image processing method, including:

obtaining a sample image set for model training, wherein the sample image set comprises a first sample image set and a second sample image set, the first sample image set comprises a first sample image and a class label of the first sample image, the second sample image set comprises a second sample image, and the first sample image and the second sample image are both images in a target field corresponding to a target task;

carrying out image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set at least comprises a first augmented image set, the first augmented image set comprises a first augmented image and a class label of the first augmented image, the first augmented image is obtained by performing image enhancement processing on a first sample image, and the class label of the first augmented image is consistent with the class label of the first sample image;

selecting a training data set of a target task from the sample image set and the auxiliary image set, and training a first model by adopting the training data set of the target task; the training data set of the target task comprises at least a first sample image set and a first augmented image set;

selecting a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and training a second model by adopting the training data set of the auxiliary task; the second model shares the model structure and the model parameters of the feature extractor with the first model;

and determining the trained first model as a target model for executing the target task, wherein the target model is used for identifying images in the target field.

In one aspect, an embodiment of the present application provides an image processing apparatus, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a sample image set for model training, the sample image set comprises a first sample image set and a second sample image set, the first sample image set comprises a first sample image and a class label of the first sample image, the second sample image set comprises a second sample image, and the first sample image and the second sample image are images in a target field corresponding to a target task;

the processing unit is used for carrying out image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set at least comprises a first augmented image set, the first augmented image set comprises a first augmented image and a class label of the first augmented image, the first augmented image is obtained by performing image enhancement processing on a first sample image, and the class label of the first augmented image is consistent with the class label of the first sample image;

the training unit is used for selecting a training data set of a target task from the sample image set and the auxiliary image set and training a first model by adopting the training data set of the target task; the training data set of the target task comprises at least a first sample image set and a first augmented image set;

the training unit is used for selecting a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set and training a second model by adopting the training data set of the auxiliary task; the second model shares the model structure and the model parameters of the feature extractor with the first model;

and the determining unit is used for determining the trained first model as a target model for executing the target task, and the target model is used for identifying the image in the target field.

In one aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the image processing method described above.

In one aspect, the present application provides a computer storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, execute the image processing method described above.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium, and when the computer instructions are executed by a processor of a terminal device, the computer instructions perform the methods in the foregoing embodiments.

By the image processing method provided by the embodiment of the application, the sample image set for model training can be subjected to image conversion processing to obtain an auxiliary image set; model training is carried out by utilizing the sample image set and the auxiliary image set, so that the problem of training data collection is solved, training data for model training is enriched, and more accurate models can be obtained by training. In addition, in the process of training the first model, a second model is selected for auxiliary training, and the second model and the first model can share the model structure and the model parameters of the feature extractor; and training data are selected from the sample image set and the auxiliary image set in a targeted manner according to the first model and the second model, wherein the training data can comprise images with labels or images without labels, and the trained first model can have better performance through the model structure and the self-supervision and semi-supervision training mode, so that when the trained first model is adopted to execute a target task to recognize images in a target field, a better image recognition result can be obtained, and the accuracy of image recognition is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an image processing system according to an embodiment of the present application;

fig. 2a is a scene schematic diagram of a sample image set provided in an embodiment of the present application;

FIG. 2b is a schematic diagram of a scenario for training a first model according to an embodiment of the present application;

FIG. 2c is a schematic diagram of a scenario for training a second model according to an embodiment of the present application;

FIG. 2d is a schematic diagram of another scenario for training a second model according to an embodiment of the present application;

fig. 2e is a schematic view of a scene of image processing provided in an embodiment of the present application;

FIG. 2f is a schematic view of another image processing scenario provided in an embodiment of the present application;

FIG. 3 is a flow chart of an image processing method according to an embodiment of the present application;

FIG. 4a is a schematic flowchart of an image segmentation process provided in an embodiment of the present application;

FIG. 4b is a schematic flowchart of another image segmentation process provided in the embodiment of the present application;

FIG. 4c is a schematic diagram of a process for training a first model according to an embodiment of the present application;

FIG. 5 is a flow chart of another image processing method provided in the embodiments of the present application;

FIG. 6a is a schematic flowchart of a process for training a first auxiliary model according to an embodiment of the present application;

FIG. 6b is a schematic flowchart of a process for training a second auxiliary model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the embodiment of the application belongs to a computer vision technology and a deep learning technology belonging to the field of artificial intelligence.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Deep Learning (DL) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In the present application, the present disclosure mainly relates to a computer vision technology for performing image processing on images in a sample image set, and specifically, performing image conversion processing on the sample image set to obtain an auxiliary image set; and to training a first model and training a second model involving deep learning techniques. Subsequently, when the first model and the second model both meet the model convergence condition, the trained first model is used as a target model corresponding to the target task, and the target model can be called to identify the image to be processed, so that the image category of the image to be processed in the target field is obtained.

The application can be applied to the following scenes: a sample image set for model training may be obtained, model training may be performed on the first model and the second model through the sample image set and an auxiliary image set obtained by performing image conversion processing on the sample image set, and the trained first model may be used as a target model. The target model may be a model for recognizing a human image field image, a medical image for recognizing cancer cells, an animal field image for recognizing an animal type, or the like.

If the target model is a model for identifying an image in the portrait domain and the image to be processed is an image in the portrait domain, when the image to be processed needs to be identified, the target model obtained by training in the application can be called to identify the image to be processed, and the image category of the image to be processed in the portrait domain is determined, for example, whether the image to be processed is a non-mainstream image is determined. Subsequently, in a video auditing system and a video copyright identification system, if the image to be processed is identified to be a non-mainstream image, auditors can be reminded to perform key auditing or directly intercept the image.

If the target model is a model of a medical image for identifying cancer cells and the image to be processed is an image in the medical field, when the image to be processed needs to be subjected to image identification, the target model trained by the application can be called to identify the image to be processed, and the image identification of the image to be processed in the medical field is determined, for example, whether the image to be processed contains cancer cells is identified.

If the target model is a model for identifying an animal field image of an animal type and the image to be processed is an image in the animal field, when the image to be processed needs to be identified, the target model trained by the application can be called to identify the image to be processed, and the image identification of the image to be processed in the animal field is determined, for example, which animal type (cat, dog, rabbit, etc.) the image to be processed is identified.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an image processing system according to an embodiment of the present disclosure. The image processing system includes a server 140 and a terminal device cluster, where the terminal device cluster may include: terminal device 110, terminal device 120,. -, terminal device 130, etc. The terminal device cluster and the server 140 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The server 140 shown in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

The terminal device 110, the terminal device 120, the terminal device 130, and the like shown in fig. 1 may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a vehicle, a roadside device, an aircraft, a wearable device, such as a smart watch, a smart bracelet, a pedometer, and the like, and may be an intelligent device having an image processing function.

Taking the terminal device 110 as an example, the terminal device 110 obtains a sample image set for model training, where the sample image set includes a first sample image set and a second sample image set, the first sample image set includes a first sample image and a class label of the first sample image, the second sample image set includes a second sample image, and the first sample image and the second sample image are both images in a target field corresponding to the target task. Terminal device 110 may send the sample image set to server 140. The server 140 performs image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set comprises a first augmented image set, the first augmented image set comprises a first augmented image and a category label of the first augmented image, the first augmented image is obtained after image enhancement processing is carried out on the first sample image, and the category label of the first augmented image is consistent with the category label of the first sample image. The server 140 selects a training data set of the target task from the sample image set and the auxiliary image set, and trains the first model by using the training data set of the target task; the training data set of the target task comprises at least a first sample image set and a first augmented image set; the server 140 selects a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and trains a second model by adopting the training data set of the auxiliary task; the second model shares the model structure and the model parameters of the feature extractor with the first model; and determining the trained first model as a target model for executing the target task, wherein the target model can be used for recognizing images in the target field.

The server 140 transmits the trained target model to the terminal device 110. Subsequently, when the terminal device 110 obtains an image identification request submitted by a user, the image identification request carries an image to be processed in the target field, and then the terminal device 110 may call the target model to perform image identification on the image to be processed, so as to determine the image category of the image to be processed.

In a possible implementation manner, when the terminal device 110 obtains an image identification request submitted by a user, the image identification request carries an image to be processed in a target field, then, the terminal device 110 sends the image to be processed to the server 140, and the server 140 calls a target model to perform image identification on the image to be processed, so as to determine an image category of the image to be processed. Finally, the server 140 transmits the image category of the image to be processed to the terminal device 110. Subsequently, the terminal device 110 may display the image category of the image to be processed on the image recognition interface.

It should be noted that, image conversion processing is performed on the sample image set to obtain an auxiliary image set; selecting a training data set of a target task from the sample image set and the auxiliary image set, and training a first model by adopting the training data set of the target task; selecting a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and training a second model by adopting the training data set of the auxiliary task; all of the operational steps resulting in the trained target model are not necessarily performed by the server 140. Or may be performed by terminal device 110 or any terminal device in a cluster of terminal devices.

It is to be understood that the system architecture diagram described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

In the embodiment of the present application, the first model refers to a model that needs to be used to execute a target task (or called a main task), that is, the trained first model needs to be used to execute the target task; for example, if the target task is a face recognition task, then the trained first model is the model used to perform face recognition. The second model refers to an auxiliary model used for assisting the training of the first model, and the first model and the second model share the model structure and the model parameters of the feature extractor, which can enable the first model to simultaneously execute an auxiliary task; for example, the target task is portrait recognition, and the auxiliary task is to recognize the position of the left eye in the portrait; the second model may be an auxiliary model (e.g., a location-based classification model) capable of identifying the location of the object in the image, and the model structure and model parameters of the feature extractor are updated through training of the second model; and the feature extractor of the trained first model has the capability of extracting the features for executing the target task and the capability of extracting the features for executing the auxiliary task, so that the trained first model can execute the auxiliary task while executing the target task, and the capability of the first model is improved.

The second model is a model that assists the first model in training, and thus the second model may also be referred to as an auxiliary model. In the embodiment of the present application, the first model and the second model may both refer to a model including a feature extractor, and the first model and the second model share a model structure and model parameters of the feature extractor. It should be particularly noted that the sharing of the model structure of the feature extractor by the first model and the second model means: the model structures of the feature extractor in the first model and the feature extractor in the second model are the same and consistent, and when the model structure of any one of the feature extractor in the first model and the feature extractor in the second model changes, the model structure of the other one changes the same. The first model and the second model share the model parameters of the feature extractor, namely: the feature extractor in the first model and the feature extractor in the second model keep the same model parameters, and when the first model completes one or more times of training, the model parameters of the feature extractor in the first model are updated, so that the model parameters of the feature extractor in the second model are correspondingly updated; similarly, when the second model completes one or more training, the model parameters of the feature extractor in the second model are updated, and then the model parameters of the feature extractor in the first model are updated accordingly.

It will be appreciated that the number and structure of the second models is determined based on the actual auxiliary task, for example, in the above example, if the auxiliary task is to identify the location of an object in the image, the second model may be selected as a first auxiliary model, which may be a location-based classification model. The following steps are repeated: if the auxiliary task is to identify feature similarities between images, the second model may be a second auxiliary model, which may be a discriminant model based on feature similarities. Certainly again as follows: if the auxiliary task requires both the recognition of the position of an object in the images and the recognition of feature similarities between the images, the second model may select the first auxiliary model and the second auxiliary model at the same time.

The image processing flow of the embodiment of the present application will be described in detail below with reference to the drawings.

Referring to fig. 2a, fig. 2a is a scene schematic diagram of a sample image set according to an embodiment of the present disclosure. A user can acquire sample data (sample image set) for model training on line in a data crawling manner, wherein the sample image set can comprise a first sample image set and a second sample image set, and the first sample image set comprises a first sample image and a class label of the first sample image; the second sample image set comprises second sample images, and the first sample image and the second sample image are both images in a target field corresponding to the target task. It should be noted that the first sample image refers to an image with a category label, and the number of the first sample images in the first sample image set may be one or more, and the number is not limited in this application. The second sample image refers to an image without a category label, and likewise, the number of the second sample images in the second sample image set may be one or more, and the number thereof is not limited in this application. For example, as shown in fig. 2a, the target domain is a portrait domain, and the target task is to identify an image category corresponding to an image belonging to the portrait domain, specifically, to identify whether the image is a mainstream image or a non-mainstream image, where the non-mainstream image refers to: the portrait has the characteristics of exaggerated hair style, bright hair color, waste hair style and the like. The first sample image set may be an image set including a mainstream image and a non-mainstream image (both the mainstream image and the non-mainstream image carry a category label), and the second sample image set may be an unlabeled portrait set.

The sample image set may include, in addition to the image in the target area, an image in a non-target area, which is referred to as a third sample image, for example, in the above example, the target area is a portrait area, and then the third sample image may be a non-portrait image, such as an object, a cartoon, a landscape, and so on. That is, the sample image set may further include a third sample image set, where the third sample image set includes a third sample image and a class label of the third sample image (as in the above example, the class label of the non-human image is the non-human image); similarly, the number of the third sample images in the third sample image set may be one or more, and the number is not limited in this application.

In one possible implementation, the computer device may perform image conversion processing on the sample image set to obtain an auxiliary image set. The auxiliary image set at least comprises a first augmented image set, the first augmented image set comprises a first augmented image and a category label of the first augmented image, the first augmented image is obtained by performing image enhancement processing on the first sample image, the first augmented image also comprises the category label, and the category label of the first augmented image is consistent with the category label of the first sample image. Specifically, the image conversion process includes, but is not limited to, an image enhancement process and an image segmentation process. The image enhancement processing may be, for example: randomly cutting; horizontally turning the image; rotating the image; noise is increased; modifying the brightness of the image; modifying the image saturation; covering a small portion of the image content; gaussian blur. Furthermore, the above schemes may be combined, such as rotating the image first, adding gaussian blur, and so on. The image segmentation process may be, for example: trisection, quartesion, nonaquoise, and so on.

Referring to fig. 2b, fig. 2b is a schematic view of a scenario for training a first model according to an embodiment of the present disclosure. In one possible implementation, the computer device selects a training data set of the target task from the sample image set and the auxiliary image set, and trains the first model using the training data set of the target task. Wherein, the model structure of the first model can be seen in fig. 2b, and the first model comprises a first feature extractor and a first classifier connected with the first feature extractor. And, the training data set of the target task includes at least a first sample image set and a first augmented image set. For example, as shown in fig. 2b, a first sample image of the first sample image set and a first augmented image of the first augmented image set may be input into the first model as a training data set of the first model. If the first sample image is an image in the portrait field, the category label is a mainstream image or a non-mainstream image; then, after image enhancement processing is performed on the mainstream image and the non-mainstream image, a first augmented image set is obtained, the first augmented image in the first augmented image set also carries a category label, and the category label of the first augmented image is consistent with the category label of the first sample image. For example, if the category label carried by a certain first sample image is a mainstream image, the category label carried by the first augmented image corresponding to the first sample image is also the mainstream image; the category label carried by a certain first sample image is a non-mainstream image, and then the category label carried by the first augmented image corresponding to the first sample image is also a non-mainstream image. Since each image in the first sample image set and the first augmented image set carries a label (mainstream or non-mainstream), the first model (including the first feature extractor and the full-link layer FC1) is passed, where the full-link layer FC1 is the first classifier. The first model learns the class labels of the images, so that the corresponding classification result in the first model is either a non-mainstream or a mainstream. As described above, the first model is trained with the first sample image set and the first augmented image set until the first model reaches a convergence condition, and training of the first model is stopped. Here, the first model may reach the convergence condition in any one of the following cases: the loss function of the first model is smaller than a set threshold; the loss function of the first model has already become stable and no longer changes as the training process continues; all training data used to train the first model (e.g., each image in the first sample image set and the first augmented image set) is involved in the training; when the number of times of training of the first model reaches a reference training number threshold (where the reference training number threshold is much smaller than a training threshold that is usually required to be met when training the model), and so on.

In one possible implementation, the computer device selects a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and trains the second model using the training data set of the auxiliary task. Referring to fig. 2c, fig. 2c is a schematic view of a scenario for training a second model according to an embodiment of the present disclosure. In an embodiment, the auxiliary task may be to identify the position of an object in the image, in which case the model structure of the second model may be as shown in fig. 2c, and the second model may include a second feature extractor and a second classifier connected to the second feature extractor as shown in fig. 2 c. The second model shares the model structure and the model parameters of the feature extractor with the first model, and specifically means: the first feature extractor and the second feature extractor share a model structure and model parameters. In this embodiment, the auxiliary image set may further include a first cut image set and a second cut image set, where the first cut image set includes the first cut image and a position tag of the first cut image; the first segmentation image is obtained by carrying out image segmentation processing on the first sample image; the second segmented image set comprises a second segmented image and a position label of the second segmented image, and the second segmented image is obtained by performing image segmentation processing on the second sample image. The training data set of the auxiliary task here then comprises a first set of sliced images and a second set of sliced images. With the second model (including the second feature extractor and the full connectivity layer FC2), the full connectivity layer FC2 is the second classifier. The classification result corresponding to the second model may be a location tag in the image. Similarly, when the second model reaches the convergence condition, the training of the second model is stopped. Here, the second model may reach the convergence condition in any one of the following cases: the loss function of the second model is smaller than a set threshold; the loss function of the second model has already become stable and no longer changes as the training process continues; all training data used for training the second model are involved in the training; when the number of times of training of the second model reaches a reference training number threshold (where the reference training number threshold is much smaller than the training threshold that is usually required to be met when training the model), and so on.

In another embodiment, the auxiliary task may be to require that feature similarities between images be identified. Referring to fig. 2d, fig. 2d is a schematic view of another scenario for training a second model according to an embodiment of the present disclosure. In this embodiment, the model structure of the second model can be seen in fig. 2d, and as shown in fig. 2d, the second model can include two third feature extractors and similar discriminators respectively connected to the two third feature extractors. The twin network model used here, the model structure and the model parameters between the two third feature extractors are also identical. The second model shares the model structure and the model parameters of the feature extractor with the first model, and specifically means: the first feature extractor and the two third feature extractors share a model structure and model parameters. In this embodiment, the auxiliary image set may include: a first cut image set, a second cut image set, a first augmented image set, and a second augmented image set. The first segmentation image set comprises a first segmentation image and a position label of the first segmentation image; the first segmentation image is obtained by carrying out image segmentation processing on the first sample image; the second segmentation image set comprises a second segmentation image and a position label of the second segmentation image, the second segmentation image is obtained after image segmentation processing is carried out on a second sample image, the second augmentation image set comprises a second augmentation image, and the second augmentation image is obtained after image enhancement processing is carried out on the second sample image. The training data set of the auxiliary task here then comprises a first cut image set, a second cut image set, a first augmented image set and a second augmented image set and a sample image set. The classification result corresponding to the second model may be a feature similarity between the images. Similarly, when the second model reaches the convergence condition, the training of the second model is stopped. Here, the second model may reach the convergence condition in any one of the following cases: the loss function of the second model is smaller than a set threshold; the loss function of the second model has already become stable and no longer changes as the training process continues; all training data used for training the second model are involved in the training; when the number of times of training of the second model reaches a reference training number threshold (where the reference training number threshold is much smaller than the training threshold that is usually required to be met when training the model), and so on.

In the embodiment of the present application, a joint training process of the first model and the second model includes: firstly, training a first model and updating model parameters of a first feature extractor; synchronously updating the model parameters of the second feature extractor based on the updating of the model parameters of the first feature extractor; thirdly, training the second model and updating the model parameters of the second feature extractor; and fourthly, updating the model parameters of the first characteristic extractor based on the updating of the model parameters of the second characteristic extractor, and then synchronously updating the model parameters of the first characteristic extractor in reverse. Thus, one joint training is completed, such one joint training can be called as one cycle process, and the first model and the second model can be jointly and alternately trained by repeating the cycle process. Of course, the above-mentioned one-time joint training process may also be performed on the second model first and then on the first model, and the application does not limit the training sequence. Regardless of which model is trained first, the feature extractor of the other model when trained is the feature extractor of the previously trained model, so that the feature extractor can learn the model performance required to be learned by the two models simultaneously. And after the first model and the second model both reach the convergence condition, finishing the training of the first model and the second model.

Subsequently, the trained first model can be used as a target model corresponding to the target task, wherein the target model is used for recognizing images in the target field. For example, please refer to fig. 2e, where fig. 2e is a schematic view of a scene of image processing according to an embodiment of the present disclosure. If the target field is the portrait field, the method and the device can be applied to a video auditing system and a video copyright identification system and used for advanced filtering judgment of non-mainstream videos. Specifically, whether the target image or video is a non-mainstream image or video can be identified by using the method, the distinguishing label is automatically given, and auditors are reminded to perform key audit or directly intercept. As shown in the left image (10) in fig. 2e, if a user selects an image to be processed in an album, and uploads the selected image to a website or platform, then the website or platform can identify the image type of the image to be processed by using the scheme of the present application, specifically, it can be determined whether the image to be processed belongs to a non-mainstream image, where the non-mainstream image specifically refers to: the hair style is exaggerated, the hair color is bright, the hair style is waste, and the like. As shown in the right diagram (20) of fig. 2e, if the image to be processed is a non-mainstream image, the system may display a prompt popup, which includes an "intercept" control and an "exit" control. The user (website operation maintainer) can click the 'interception' control, and then the system can intercept and filter the image to be processed, and through the scheme, the harmonious and healthy website or platform can be maintained.

Of course, besides the user manually selects to perform the intercepting filtering on the non-mainstream image, the system can also automatically perform the filtering processing after recognizing the non-mainstream image. Referring to fig. 2f, fig. 2f is a schematic view of another image processing scene provided in the embodiment of the present application. As shown in the left diagram in fig. 2f (30), the user uploads the image to be processed, and then, according to the scheme of the present application, if it is identified that the image to be processed is a non-mainstream image, the system automatically performs filtering processing on the non-mainstream image, and displays a prompt interface, where the prompt interface may include a prompt message, and the prompt message may be: "non-mainstream images, automatically filtered". By the scheme, the system can automatically identify the image type of the image to be processed, automatically filter the non-mainstream image, automatically filter the image to be processed, reduce the workload of website operation maintenance personnel and facilitate the construction of good network atmosphere.

By adopting the image processing method provided by the scheme, the second model for executing the auxiliary task assists in the training of the first model for executing the target task, so that a better image recognition result can be obtained when the finally trained target model (the trained first model) executes the target task to recognize the image in the target field, and in each business scene actually related to image recognition, the target model provided by the scheme can better recognize the image, and the accuracy of image recognition is improved.

Referring to fig. 3, fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application. The method is applied to a computer device, and as shown in fig. 3, the data processing method may include steps S310 to S350. Wherein:

step S310: a sample image set is obtained for model training, the sample image set including a first sample image set and a second sample image set.

In specific implementation, the first sample image set comprises a first sample image and a class label of the first sample image, the second sample image set comprises a second sample image, and the first sample image and the second sample image are both images in a target field corresponding to the target task. It should be noted that the first sample image refers to an image with a category label, and the number of the first sample images in the first sample image set may be one or more, and the number is not limited in this application. The second sample image refers to an image without a category label, and likewise, the number of the second sample images in the second sample image set may be one or more, and the number thereof is not limited in this application. The target field is a portrait field, the target task is to identify an image category corresponding to an image belonging to the portrait field, specifically, to identify whether the image is a mainstream image or a non-mainstream image, wherein the non-mainstream image means: the portrait has the characteristics of exaggerated hair style, bright hair color, waste hair style and the like. The first sample image set may be an image set including a mainstream image and a non-mainstream image (both the mainstream image and the non-mainstream image have labels), and the second sample image set may be an unlabeled portrait set.

In one possible implementation, the way in which the computer device obtains the sample image set for model training may be: and crawling online data. In particular, data crawling may include: (1) crawling non-mainstream image data through various keywords related to human faces, hairs, lips and the like in a web crawler mode, such as a website with public images, such as google, Baidu, yahoo and the like; (2) and (4) performing subset parsing crawling by using a plurality of large image classification data sets, such as a large open source data set OpenImage of google, and screening out data related to the face image.

Further, the sample image set may be crawled from the line and a wash of the sample data is performed. Since data cleansing can be labor intensive, the data cleansing operation does not require all collected training data to be processed, wherein unwashed data is considered unlabeled data. Data cleansing (Data cleansing) is a process of reviewing and verifying Data, and aims to remove duplicate information, correct existing errors, and provide Data consistency.

The target field may include a human image field, a medical field, an animal field, a natural image field, and the like, among others. If the target field is a portrait field, the target task may be to identify an image category corresponding to an image in the portrait field, specifically, to identify whether the image is a mainstream image or a non-mainstream image.

The category label carried by the first sample image may be: a mainstream picture and a non-mainstream picture. Wherein, the non-mainstream image is as follows: the portrait has the characteristics of exaggerated hair style, bright hair color, waste hair style and the like. The first sample image set may be an image set including a mainstream image and a non-mainstream image (both the mainstream image and the non-mainstream image carry a category label), and the second sample image set may be an unlabeled portrait set. Of course, the sample image set may include images in non-target areas, such as non-human images (animals, scenery, etc.), in addition to images in target areas. Also, the images in the non-target area may be referred to as a third sample image set. The third sample image set comprises a third sample image and a class label of the third sample image, and the third sample image is an image in a non-target field corresponding to the target task. Then, the category label of the third sample image may be: a non-human image.

For example, in a sample image set, there is mostly unlabeled portrait data, referred to as set U (second sample image set). In the remaining labeled data, the mainstream images (referred to as the set Lz) account for the majority, and the non-mainstream images (referred to as the set Lf) account for the minority. Wherein the set Lz and the set Lf are combined into a first sample image set. In addition, there is much image data on the line, which is not a human being, such as an object, a cartoon, a landscape, and the like, and such data is called a set N (a third sample image set) and needs to be prepared.

Step S320: and carrying out image conversion processing on the sample image set to obtain an auxiliary image set.

In specific implementation, the auxiliary image set includes a first augmented image set, the first augmented image set includes a first augmented image and a category label of the first augmented image, the first augmented image is obtained after image enhancement processing is performed on the first sample image, and the category label of the first augmented image is consistent with the category label of the first sample image. The computer device may perform image conversion processing on the sample image set to obtain an auxiliary image set. Wherein the image processing may comprise: image enhancement processing and image segmentation processing. The assist task is a task constructed to assist the target task, and the assist task is a task that the model automatically learns and can automatically generate a new label (category label or position label) for the images in the sample image set. For example, the auxiliary tasks may be: it is required to recognize the position of the left eye in the portrait or to determine the feature similarity between images, etc.

Specifically, the image enhancement processing may specifically include: 1, randomly cutting; 2, horizontally turning the image; 3, rotating the image; 4, noise is increased; 5, modifying the brightness of the image; 6, modifying the image saturation; 7, covering a small part of the image content; 8, gaussian blur. Of course, random combinations can also be performed with the above image enhancement processing schemes, such as first rotating the image, adding gaussian blur, and so on. The image enhancement processing can also be various kinds of blurring (not limited to gaussian blurring, such as intelligent blurring and fast blurring); the noise can be added in a salt and pepper noise mode, a motion blur mode and the like; the image may also be overlaid in a pattern of arbitrary shape, etc.; or stylized filtering of the image using various filters, such as gradient imaging of the image, etc.

Specifically, for the image segmentation processing, due to the martial image (non-mainstream image), compared with the image of the ordinary person (mainstream image), there are strong features, i.e., strange hair style, rich hair color, and makeup non-mainstream (e.g., black lipstick, dark eye shadow, ear nail, nose ring, etc.). These features can be exploited to further enhance the data.

As shown in fig. 4a, fig. 4a is a schematic flowchart of an image segmentation process provided in this embodiment, each of the human images (sets U, Lz, Lf) in the training image may be segmented uniformly from top to bottom into three parts (each part is referred to as an image block), and then the corresponding position label includes: upper, middle, lower 3 tags. Of course, instead of uniformly dividing the image into 3 parts, it can also be extended into 4 parts: the image is horizontally and vertically halved so that 4 image blocks are obtained, and then the corresponding position labels include: top left, bottom left, top right, bottom right 4 tags. Fig. 4b is a schematic flowchart of another image segmentation process provided in this embodiment of the present application, and fig. 4b is a schematic flowchart of each of the human images (sets U, Lz, Lf) in the training image, which is uniformly segmented into three parts from top to bottom. In addition, the method can also be expanded to cut into 9 parts, namely, the image is vertically trisected horizontally, and the image is cut into 9 blocks by a tic character, so that the corresponding position labels comprise: 9 tags, each tag corresponding to a particular one of the 3x3 grids.

And finally, carrying out image segmentation processing on the images in each image set to obtain all image blocks marked as U, Lz and Lf. Specifically, the set corresponding to all image blocks obtained by segmenting all images in the set U is U, the set corresponding to all image blocks obtained by segmenting all images in the set Lz is Lz, and the set corresponding to all image blocks obtained by segmenting all images in the set Lf is Lf.

It should be noted that, if the third sample image set N is included in the sample image set, since the set N includes the non-human image set, the image data is sufficiently large, and the non-human image set N does not need to perform the image enhancement processing and the image segmentation processing.

Step S330: and selecting a training data set of the target task from the sample image set and the auxiliary image set, and training a first model by adopting the training data set of the target task.

In particular, the training data set of the target task includes at least a first sample image set and a first augmented image set. In addition, a third sample image set is also included in the sample image set, and the third sample image set includes a third sample image and a class label of the third sample image, and the third sample image is an image in a non-target domain corresponding to the target task. In general, the training data set for the target task includes: a first sample image set, a first augmented image set, and a third sample image set. And the first model comprises a first feature extractor and a first classifier connected with the first feature extractor.

In a possible implementation manner, the computer device calls a first model to identify a first sample image in a first sample image set, so as to obtain a category prediction label corresponding to the first sample image; the computer equipment calls a first model to identify a first augmented image in the first augmented image set to obtain a category prediction label corresponding to the first augmented image; the computer equipment calls the first model to identify a third sample image in the third sample image set to obtain a category prediction label corresponding to the third sample image; the computer device adjusts model parameters of the first model based on a difference between the class label and the class prediction label of the first sample image, a difference between the class label and the class prediction label of the first augmented image, and a difference between the class label and the class prediction label of the third sample image.

Specifically, the category label refers to: non-human images, mainstream images, non-mainstream images. The category label corresponding to each image in the third sample image is a non-human image, the category label corresponding to each image in the first sample image is a mainstream image or a non-mainstream image, and the image enhancement processing does not change the label of the image, so the category label of each image in the first augmented image set is also a mainstream image or a non-mainstream image, and the label carried by each image in the first augmented image set and the label carried by each image in the first sample image set are in one-to-one correspondence. For example, if a tag carried by a certain image in the first sample image set is a main stream image, a tag carried by an image corresponding to the certain image in the first augmented image set is also a main stream image. Of course, the category label corresponding to each image in the first sample image may additionally include label types, such as "wedding", "antique", and the like, in addition to the mainstream image or the non-mainstream image. Since the pictures of the two categories ("wedding", "antique") are close to the martial (non-mainstream images) in appearance, the pictures can be used as additional supervision information to improve the accuracy of the classification network for classifying the images. In addition, the added additional label types can be used for identifying wedding images and ancient dress images in the network images, and the additional output of the classification task is used, so that the value of the classification task is improved.

For example, please refer to fig. 4c, fig. 4c is a schematic flowchart of a process for training a first model according to an embodiment of the present disclosure. This network as shown in fig. 4c may be called network a. This classification net a, consisting of one feature extractor G (the first feature extractor) plus the last Fully Connected layer (FC 1), learns the labels of the images. Since a classifier is trained, the loss function used after the processing of the full connection layer FC1 (first classifier) may be a function commonly used for classification, such as softmax, weighted cross entry, large margin loss (center loss, cosface, arcfacce), and so on. The feature extractor may be composed of any high-performance neural network structure, such as Deep residual network (Resnet), Dense connection network (Densenet), and so on. Of course, the feature extractor may also be a Visual Geometry Group Network (VGG), alexnet, SEResnet, resenext, etc. It should be noted that the network a shown in fig. 4c belongs to a main task (target task) in model training.

Step S340: and selecting a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and training a second model by adopting the training data set of the auxiliary task.

In particular, the second model includes a first auxiliary model and a second auxiliary model. Wherein the first auxiliary model may be a location-based classification model and the second auxiliary model may be a feature similarity-based discriminant model. The first auxiliary model comprises a second feature extractor and a second classifier connected with the second feature extractor; and, the first feature extractor shares the model structure and the model parameters with the second feature extractor. The second auxiliary model comprises two third feature extractors and similar discriminators; the first feature extractor shares the model structure and the model parameters with the third feature extractor. Therefore, the second model includes a second feature extractor and a second classifier connected to the second feature extractor, and includes two third feature extractors and similar discriminators respectively connected to the two third feature extractors; the first feature extractor, the second feature extractor and the third feature extractor share a model structure and model parameters.

Of course, the model parameters of the second feature extractor may be used as new model parameters of the first feature extractor, and the first model may be trained again, that is, the first model and the second model may be alternately trained.

Of course, since the trained second model is used for constructing an auxiliary task, the second model in the embodiment of the present application may have a third auxiliary model, even a fourth auxiliary model, and the like in addition to the first auxiliary model and the second auxiliary model, which is not limited in the embodiment of the present application. Suitable auxiliary tasks may be constructed as the case may be. The classification result corresponding to the first auxiliary model is a position label of the image, and the discrimination result corresponding to the second auxiliary model is the feature similarity between the images.

Step S350: and determining the trained first model as a target model for executing the target task.

In specific implementation, the first model corresponds to a target task, and the second model corresponds to an auxiliary task, so that when the trained first model and the trained second model both satisfy the model convergence condition, the target model is the trained first model corresponding to the target task, and certainly, the target model is an image for recognizing the target field. Target areas include, but are not limited to: the human image field, the medical field, the animal field, and the natural field, and the like.

In a possible implementation manner, when any one of the first model and the second model does not satisfy the model convergence condition, the alternating training of the first model and the second model needs to be continued until both the trained first model and the trained second model satisfy the model convergence condition. And then, taking the trained first model as a target model for executing the target task.

In one possible implementation manner, the target model comprises a trained first feature extractor and a trained first classifier, and the computer device displays an image recognition interface, wherein the image recognition interface comprises an image recognition control and an image import control; when the image import control is triggered, acquiring an image to be processed in the target field, and displaying the image to be processed on an image recognition interface; when the image recognition control is triggered, calling a trained first feature extractor to perform feature extraction on an image to be processed to obtain image features corresponding to the image to be processed, and calling a trained first classifier to recognize the image features corresponding to the image to be processed to obtain an image category corresponding to the image to be processed; and displaying the image category corresponding to the image to be processed on an image recognition interface.

For example, the image processing method provided by the embodiment of the application can be applied to the human image field, the medical field, the animal field, the natural field, and the like. Taking the portrait domain as an example for detailed description, as shown in fig. 2e, first, as shown in the left diagram in fig. 2e (10), the user may select one image to be processed in the album, and then click the "send" button (image import control), the computer device obtains the image to be processed in the target domain, and displays the image to be processed in the image recognition interface. Then, as shown in the right diagram of fig. 2e (20), when the user clicks an "image recognition" button (image recognition control) in the image recognition interface, the computer device may invoke a target model (first model after training) to perform image recognition on the image to be processed, specifically, perform feature extraction on the image to be processed by using the first feature extractor after training to obtain image features corresponding to the image to be processed, and invoke the first classifier after training to recognize image features corresponding to the image to be processed to obtain image categories corresponding to the image to be processed. Finally, the image category (assumed as a non-mainstream image) corresponding to the image to be processed can be displayed on the image recognition interface.

Subsequently, the image to be processed may be post-processed according to the image category corresponding to the image to be processed, for example, the image to be processed is intercepted, or a warning or a seal number processing is performed to a user who sends or uploads the image to be processed, and the like.

Referring to fig. 5, fig. 5 is a schematic flowchart of another image processing method according to an embodiment of the present disclosure. The method is applied to a computer device, and the embodiment of fig. 5 is a specific embodiment corresponding to step S340 in the embodiment of fig. 3. As shown in fig. 5, the data processing method may include steps S510 to S530. Wherein:

step S510: training the first auxiliary model using the training data set of the auxiliary task.

In particular implementations, the first auxiliary model is a location-based classification model. The auxiliary image set comprises a first cut image set and a second cut image set, and the first cut image set comprises a first cut image and a position label of the first cut image; the first segmentation image is obtained by carrying out image segmentation processing on the first sample image; the second segmentation image set comprises a second segmentation image and a position label of the second segmentation image, and the second segmentation image is obtained by carrying out image segmentation processing on the second image; the training data set of the auxiliary task includes a first set of sliced images and a second set of sliced images. The location tag may specifically be: upper, middle and lower; or upper left, upper right, lower left, lower right, etc. Alternatively, the location tag may be a location coordinate, and the location coordinate may be an abscissa and an ordinate, or a longitude and latitude coordinate, and the like.

In one possible implementation mode, the computer equipment calls a first auxiliary model to identify a segmentation image in a training data set of an auxiliary task, and a position prediction label corresponding to the segmentation image is obtained; the segmented images in the training dataset of the auxiliary task comprise the first segmented image and/or the second segmented image. The computer device adjusts model parameters of the first auxiliary model according to a difference between the position label and the position prediction label of the segmented image.

For example, the sets U, Lz, and Lf all include human figures, and are all the top, middle, and bottom image blocks of a complete image. Thus, a classifier can be constructed that learns whether the input image is the upper, middle, or lower portion of the image. Similarly, such a classifier is also composed of one feature extractor G (third feature extractor) and one full-link layer FC2 (second discrimination unit model). Referring to fig. 6a, fig. 6a is a schematic flowchart of a process for training a second model according to an embodiment of the present disclosure. This network as shown in fig. 6a may be referred to as network B. The structure is that the feature extractor can learn the spatial features of the image containing the person, and the hair features are generally distributed on the image; the facial features are distributed in the middle of the image and should be below the facial features compared to the hair. In training, the (net a) feature extractor G trained in the previous step is used. After the above two training steps (net a and net B), this feature extractor G, can learn: the distinction of mainstream/non-mainstream characteristics, and the spatial position information of the image containing the person. Since a classifier is trained, the loss function used after FC2 may be a function commonly used for classification, such as softmax, weighted cross entry, etc.

Step S520: and adjusting the model parameters of the third extractor in the second auxiliary model into the model parameters of the second feature extractor in the trained first auxiliary model.

In particular, the second model shares the model structure and the model parameters of the feature extractor with the first model. And the second model comprises a first auxiliary model and a second auxiliary model, the first auxiliary model comprises a second feature extractor and a second classifier connected with the second feature extractor, and the second auxiliary model comprises two third feature extractors and similar discriminators respectively connected with the two third feature extractors. The second model shares the model structure and the model parameters of the feature extractor with the first model, and specifically, the second model is obtained by adjusting the model parameters of the third feature extractor to the model parameters of the trained second feature extractor.

In the present application, in the case where the second model includes the first auxiliary model and the second auxiliary model, the first auxiliary model is trained first, and then the second auxiliary model is trained. Therefore, after the computer device finishes training the first auxiliary model, the model parameters of the third feature extractor in the second auxiliary model are consistent with the model parameters of the second feature extractor in the trained first auxiliary model.

If the second auxiliary model is trained first and then the first auxiliary model is trained, after the second auxiliary model is trained, keeping the model parameters of the second feature extractor in the first auxiliary model consistent with the model parameters of the third feature extractor in the trained second auxiliary model. In general, whichever auxiliary model is trained first, model parameters of the untrained auxiliary model are kept consistent with those of another auxiliary model which is trained first.

Step S530: the second auxiliary model is trained using the training data set of the auxiliary task.

In specific implementation, the second auxiliary model is a discriminant model based on feature similarity. The auxiliary image set further comprises a second augmented image set, the second augmented image set comprises a second augmented image, and the second augmented image is obtained after image enhancement processing is carried out on a second sample image; the training data set of the auxiliary task includes a sample image set and an auxiliary image set, that is, the training data set of the auxiliary task specifically includes: a sample image set and a first augmented image set and a second augmented image set.

In a possible implementation manner, the computer device invokes the second auxiliary model to perform feature extraction on the sample images in the sample image set, so as to obtain image features corresponding to the sample images. The computer equipment calls a second auxiliary model to perform feature extraction on the augmented images in the auxiliary image set to obtain image features corresponding to the augmented images; the computer device trains a second auxiliary model according to the feature similarity between the image features corresponding to the sample images and the image features corresponding to the augmented images.

For example, the sets U ', Lz ', Lf ' are extensions of the image sets U, Lz, Lf, and therefore should contain similar content. In this way, the information of the sets U and U' can be used, even if the two sets do not contain supervision tags that are or are not mainstream. Here, the resulting siamese network of the twin network can be used to measure whether the input image content is similar. Referring to fig. 6b, fig. 6b is a schematic flowchart of a process for training a second auxiliary model according to an embodiment of the present disclosure. As shown in fig. 6b, the feature extractor G shown in fig. 6b is two identical models, and therefore called twin network, but the input parameters of the models are different, the upper model inputs are U, Lz, Lf, the lower model inputs are U ', Lz ', Lf ', we refer to this network as network C, which may include two third feature extractors and similar discriminators respectively connected to the third feature extractors. At this step we still use the feature extractor G (third feature extractor) used in the two steps above.

In one possible implementation, the computer device obtains a cosine distance between an image feature corresponding to the sample image and an image feature corresponding to the augmented image; the computer equipment determines the characteristic similarity between the sample image and the augmented image according to the cosine distance; the computer device determines a loss function of the second model according to the feature similarity sum between all images of the sample image set and all images of the auxiliary image set; the computer device adjusts model parameters of the second auxiliary model according to the loss function.

For example, the inputs to the twin network are a map i from the sets U, Lz, Lf, and its augmentation j (so image j is from the sets U ', Lz ', Lf '). If the category label carried in graph i matches the category label carried in graph j, this means that j is the increase in i, and the two images are identical in content. So, after passing through the feature extractor G, the images i and j will get two features G (i) and G (j). The final loss function, needs to measure the similarity of these two features: if the two features are more similar, the penalty value of the loss function should be smaller; conversely, the less similar the penalty value, the greater the penalty value. If j is not an extension of i, the resulting loss functions are made to differ as much as possible in their characteristics G (i), G (j).

When training the network C, first, n images (set B) are randomly extracted from the sets U, Lz, Lf, and 1 augmentation (set B') of each image is obtained by using an image augmentation method. Therefore, the input to the network for one batch is 2n images. The original image i from the set U, Lz, Lf is input into a first branch of the network and the augmented image j is input into a second branch. After two images pass through a feature extractor G with the same parameters, the features G (i) and G (j) are obtained. We use the cosine distance to measure the similarity between features, simi. Of course, instead of calculating the similarity between features using cosine distances, the similarity between features may be determined using chebyshev distances, vector inner products, hamming distances, edit distances, and the like.

Wherein, the cosine distance is shown as formula (1):

cos(i，j)＝[G(i)·G(j)]/[||G(i)||*||G(j)||] (1)

in the formula (1), the cos (i, j) value is between 0 and 1, and the larger the value is, the greater the similarity between the representative graph i and the graph j is.

In addition, the similarity between features, simi, is shown in equation (2):

simi(i，j)＝exp{cos(i,j)}/λ (2)

in the formula (2), λ is a fixed parameter, and we can set λ to 0.1.

Then we define two graphs i, graph j yielding a Loss function Loss (i, j) as shown in equation (3):

Loss(i，j)＝-log{simi(i,j)/[∑{k＝1,2,...,n,k≠i'}simi(i,k)]} (3)

in equation (3), i' is the augmentation of image i. The denominator of the above formula represents the sum of the feature similarities of image i and all augmented images other than its augmented i'. That is, if the original image has n, the augmented image also has n, and the denominator of the above formula is the sum of n-1 feature similarities sim.

Finally, we define the loss function for the entire batch as shown in equation (4):

L＝∑{i∈B}∑{j∈B’}Loss(i,j) (4)

therefore, the final loss function takes into account the similarity between all the original image B and the augmented image B' in a batch.

In addition, the loss function of the entire batch can be modified as shown in equation (5):

L＝∑{i∈B}∑{j∈B’}Loss(i,j)+Loss(j,i) (5)

finally, after the three networks (network a, network B, and network C) are constructed, we perform cyclic training on the three steps in sequence until the networks of the three steps converge. Namely, training the classification network A in the step 1; then, bringing the trained feature extractor G into the step 2, and training the classification network B in the step 2; then bringing the feature extractor G in the step 2 into the step 3, and training the network C in the step 3; and then, bringing the feature extractor G in the step 3 into the step 1, training the classification network A in the step 1, and sequentially circulating.

In the present application, in the case where the second model includes the first auxiliary model (network a) and the second auxiliary model (network B), the training is performed in the order of training the network a first and then training the network B. Therefore, after the computer device completes training network a, the model parameters of the third feature extractor in network B are kept consistent with the model parameters of the second feature extractor in network a after training.

If the network B is trained first and then the network A is trained, after the network B is trained, the model parameters of the second feature extractor in the network A are kept consistent with the model parameters of the third feature extractor in the trained network B. In general, whichever auxiliary model is trained first, model parameters of the untrained auxiliary model are kept consistent with those of another auxiliary model which is trained first.

By the scheme, the second model is a model for assisting the first model in training. In addition, in this embodiment of the application, the second model includes a first auxiliary model and a second auxiliary model, and the first auxiliary model and the second auxiliary model are alternately trained, so as to train the second model, so that the finally trained second model can simultaneously have the characteristics of recognizing the position of a certain object in an image and recognizing the feature similarity between images. Further, the training of the first model is assisted by the training of the second model, and the model structure and the model parameters of the feature extractor are updated by the training of the second model as the first model and the second model share the model structure and the model parameters; and the feature extractor of the trained first model has the capability of extracting the features for executing the target task and the capability of extracting the features for executing the auxiliary task, so that the trained first model can execute the auxiliary task while executing the target task, and the capability of the first model is improved.

In one possible implementation, the second model is a first auxiliary model, which is a location-based classification model. The training second model may be trained using the training data set of the auxiliary task. Wherein the training data set of the auxiliary task comprises a first set of sliced images and a second set of sliced images.

In specific implementation, the auxiliary image set comprises a first cut image set and a second cut image set, and the first cut image set comprises a first cut image and a position label of the first cut image; the first segmentation image is obtained by carrying out image segmentation processing on the first sample image; the second segmented image set comprises a second segmented image and a position label of the second segmented image, and the second segmented image is obtained by performing image segmentation processing on the second image. The location tag may specifically be: upper, middle and lower; or upper left, upper right, lower left, lower right, etc. Alternatively, the location tag may be a location coordinate, and the location coordinate may be an abscissa and an ordinate, or a longitude and latitude coordinate, and the like.

In one possible implementation, the second model is trained with the set of auxiliary images (the first and second set of sliced images) until the second model reaches a model convergence condition, and training of the second model is stopped. The second model reaching the convergence condition may be any one of the following cases: the loss function of the second model is smaller than a set threshold; the loss function of the second model has already become stable and no longer changes as the training process continues; all training data used to train the second model (e.g., each image in the first and second cut-out image sets) is involved in the training; when the number of times of training of the second model reaches a reference training number threshold (where the reference training number threshold is much smaller than the training threshold that is usually required to be met when training the model), and so on.

After the training of the second model is stopped, the model parameters of the first model may be adjusted to the model parameters of the second model, and then the adjusted first model may be used as a new first model to train the first model. Then, the parameters of the second feature extractor in the second model are adjusted to the parameters of the first feature extractor of the trained first model. And in the same way, the first model and the second model are alternately trained.

With this solution, the number and model structure of the second models are determined based on the actual auxiliary task, for example, in this embodiment, if the auxiliary task is to identify the position of an object in an image, the second model may be a first auxiliary model, and the first auxiliary model may be a classification model based on the position. Further, the training of the first model is assisted by the training of the second model, and the model structure and the model parameters of the feature extractor are updated by the training of the second model as the first model and the second model share the model structure and the model parameters; and the feature extractor of the trained first model has the capability of extracting the features for executing the target task and the capability of extracting the features for executing the auxiliary task, so that the trained first model can execute the auxiliary task while executing the target task, and the capability of the first model is improved.

In one possible implementation, the second model is a second auxiliary model, and the second auxiliary model is a discriminant model based on feature similarity. The second model may be trained using the training data set of the auxiliary task. The training data set of the auxiliary task includes a sample image set and an auxiliary image set.

In specific implementation, the auxiliary image set further includes a first augmented image set and a second augmented image set, the second augmented image set includes a second augmented image, and the second augmented image is obtained after image enhancement processing is performed on a second sample image. In particular, the auxiliary task may be to determine feature similarities between images in the sample set of images and images in the auxiliary set of images.

In one possible implementation, the computer device obtains a cosine distance between an image feature corresponding to the sample image and an image feature corresponding to the augmented image; the computer equipment determines the characteristic similarity between the sample image and the augmented image according to the cosine distance; the computer device determines a loss function of the second model according to the feature similarity sum between all images of the sample image set and all images of the auxiliary image set; the computer device adjusts model parameters of the second auxiliary model according to the loss function. Based on this, a new adjusted second auxiliary model is obtained, and training of the adjusted second auxiliary model is continued based on the training data set including the sample image set and the auxiliary image set until the adjusted second auxiliary model satisfies the model convergence condition.

Similarly, the second model is trained through the sample image set and the auxiliary image set, and the training of the second model is stopped until the second model reaches the model convergence condition. The second model reaching the convergence condition may be any one of the following cases: the loss function of the second model is smaller than a set threshold; the loss function of the second model has already become stable and no longer changes as the training process continues; all training data used to train the second model (e.g., each image in the first and second cut-out image sets) is involved in the training; when the number of times of training of the second model reaches a reference training number threshold (where the reference training number threshold is much smaller than the training threshold that is usually required to be met when training the model), and so on.

According to the image processing method provided by the embodiment of the application, the auxiliary task auxiliary target task is constructed, the training data can be selected from the sample image set and the auxiliary image set in a targeted manner according to the first model and the second model, the training data can comprise images with labels or images without labels, the trained first model can have better performance through the model structure and the self-supervision and semi-supervision training mode, and therefore when the trained first model is adopted to execute the target task to recognize the images in the target field, a better image recognition result can be obtained, and the accuracy of image recognition is further improved. Furthermore, the scheme adopts a semi-supervision and self-supervision method, so that the data collected on line can be greatly utilized, and a large amount of manual labeling is not needed. Therefore, the time for this task iteration can be greatly shortened. For example, when the trained target model is applied to the portrait field to identify the non-mainstream image, the target model can learn not only the feature difference of the non-mainstream image (such as a mammogram) compared with the feature difference of the mainstream image (such as a normal human image), but also the features of the spatial position distribution and/or the feature similarity of the non-mainstream image, and therefore the accuracy and the recall rate are high.

With this solution, the number and the model structure of the second models are determined based on the actual auxiliary tasks, for example, in this embodiment of the present application, if the auxiliary tasks are to identify feature similarities between images, the second model may be a second auxiliary model, and the second auxiliary model may be a discriminant model based on the feature similarities. Further, the training of the first model is assisted by the training of the second model, and the model structure and the model parameters of the feature extractor are updated by the training of the second model as the first model and the second model share the model structure and the model parameters; and the feature extractor of the trained first model has the capability of extracting the features for executing the target task and the capability of extracting the features for executing the auxiliary task, so that the trained first model can execute the auxiliary task while executing the target task, and the capability of the first model is improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. Fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. The image processing apparatus can be applied to the computer device in the method embodiment corresponding to fig. 3 to 6 b. The image processing means may be a computer program (comprising program code) running on a computer device, for example an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The image processing apparatus may include:

an obtaining unit 710, configured to obtain a sample image set for model training, where the sample image set includes a first sample image set and a second sample image set, the first sample image set includes a first sample image and a category label of the first sample image, the second sample image set includes a second sample image, and the first sample image and the second sample image are both images in a target field corresponding to a target task;

a processing unit 720, configured to perform image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set comprises a first augmented image set, the first augmented image set comprises a first augmented image and a category label of the first augmented image, the first augmented image is obtained after the first sample image is subjected to image enhancement processing, and the category label of the first augmented image is consistent with the category label of the first sample image;

a training unit 730, configured to select a training data set of the target task from the sample image set and the auxiliary image set, and train a first model by using the training data set of the target task; the training data set of the target task includes at least the first sample image set and the first augmented image set;

the training unit 730 is further configured to select a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and train a second model by using the training data set of the auxiliary task; the second model shares a model structure and model parameters of a feature extractor with the first model;

a determining unit 740, configured to determine the trained first model as a target model for executing the target task, where the target model is used to identify an image in the target field.

In a possible implementation manner, the sample image set further includes a third sample image set, and the third sample image set includes a third sample image and a category label of the third sample image; the third sample image is an image in a non-target field corresponding to the target task; the training dataset of the target task further comprises the third sample image set;

the training unit 730 trains the first model using the training data set of the target task, including:

calling a first model to identify a first sample image in the first sample image set to obtain a category prediction label corresponding to the first sample image;

calling the first model to identify a first augmented image in the first augmented image set to obtain a category prediction label corresponding to the first augmented image;

calling the first model to identify a third sample image in the third sample image set to obtain a category prediction label corresponding to the third sample image;

and adjusting the model parameters of the first model according to the difference between the class label of the first sample image and the class prediction label corresponding to the first sample image, the difference between the class label of the first augmented image and the class prediction label corresponding to the first augmented image, and the difference between the class label of the third sample image and the class prediction label corresponding to the third sample image.

In one possible implementation, the set of auxiliary images further includes the first and second sets of sliced images, the first set of sliced images containing first sliced images and location labels of the first sliced images; the first segmentation image is obtained by carrying out image segmentation processing on the first sample image; the second segmentation image set comprises a second segmentation image and a position label of the second segmentation image, and the second segmentation image is obtained by carrying out image segmentation processing on the second image; the training data set of the auxiliary task comprises the first and second sliced image sets; the second model is a first auxiliary model;

the training unit 730 trains a second model using the training data set of the auxiliary task, including:

calling the second model to identify a segmentation image in the training data set of the auxiliary task to obtain a position prediction label corresponding to the segmentation image; the segmentation images in the training dataset of the auxiliary task comprise first segmentation images and/or second segmentation images;

and adjusting the model parameters of the second model according to the difference between the position label and the position prediction label of the segmented image.

In a possible implementation manner, the auxiliary image set further includes a second augmented image set, where the second augmented image set includes a second augmented image, and the second augmented image is obtained by performing image enhancement processing on the second sample image; the training dataset of the auxiliary task comprises the sample image set and the auxiliary image set; the second model is a second auxiliary model;

calling the second model to perform feature extraction on the sample images in the sample image set to obtain image features corresponding to the sample images;

calling the second model to perform feature extraction on the augmented images in the auxiliary image set to obtain image features corresponding to the augmented images;

and training the second model according to the feature similarity between the image features corresponding to the sample image and the image features corresponding to the augmented image.

In one possible implementation, the second model includes a first auxiliary model and a second auxiliary model; the first auxiliary model is a location-based classification model; the second auxiliary model is a discriminant model based on feature similarity.

In one possible implementation, the first model is a classification model based on image features; the first model comprises a first feature extractor and a first classifier connected with the first feature extractor;

the second model is a first auxiliary model and comprises a second feature extractor and a second classifier connected with the second feature extractor; the first feature extractor shares a model structure and model parameters with the second feature extractor.

the second model is a second auxiliary model and comprises two third feature extractors and similar discriminators; the first feature extractor and the third feature extractor share a model structure and model parameters.

the second model comprises a first auxiliary model and a second auxiliary model, and the first auxiliary model comprises a second feature extractor and a second classifier connected with the second feature extractor;

the second auxiliary model comprises two third feature extractors and similar discriminators respectively connected with the two third feature extractors; the first feature extractor shares a model structure and model parameters with the second and third feature extractors.

In one possible implementation, the objective task includes identifying an image category; the processing unit 720 is further configured to perform the following operations:

displaying an image recognition interface, wherein the image recognition interface comprises an image recognition control and an image import control;

when the image import control is triggered, acquiring an image to be processed in a target field, and displaying the image to be processed on the image recognition interface;

when the image recognition control is triggered, calling the trained target model to perform feature extraction on the image to be processed to obtain image features corresponding to the image to be processed, and calling the trained target model to recognize the image features corresponding to the image to be processed to obtain the image category corresponding to the image to be processed;

and displaying the image category on the image recognition interface.

By the image processing device, the sample image set for model training can be subjected to image conversion processing to obtain an auxiliary image set; model training is carried out by utilizing the sample image set and the auxiliary image set, so that the problem of training data collection is solved, training data for model training is enriched, and more accurate models can be obtained by training. In addition, in the process of training the first model, a second model is selected for auxiliary training, and the second model and the first model can share the model structure and the model parameters of the feature extractor; and training data are selected from the sample image set and the auxiliary image set in a targeted manner according to the first model and the second model, wherein the training data can comprise images with labels or images without labels, and the trained first model can have better performance through the model structure and the self-supervision and semi-supervision training mode, so that when the trained first model is adopted to execute a target task to recognize images in a target field, a better image recognition result can be obtained, and the accuracy of image recognition is further improved.

Please refer to fig. 8, and fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device in the corresponding embodiment of fig. 3 to fig. 6b may be the computer device 800, and as shown in fig. 7, the computer device 800 may include: a user interface 802, a processor 804, an encoder 806, and a memory 808. Signal receiver 816 is used to receive or transmit data via cellular interface 810, WIFI interface 812. The encoder 806 encodes the received data into a computer-processed data format. The memory 808 has stored therein a computer program by which the processor 804 is arranged to perform the steps of any of the method embodiments described above. The memory 808 may include volatile memory (e.g., dynamic random access memory DRAM) and may also include non-volatile memory (e.g., one time programmable read only memory OTPROM). In some examples, the memory 808 can further include memory located remotely from the processor 804, which can be connected to the computer device 800 via a network. The user interface 802 may include: a keyboard 818 and a display 820.

In the computer device 800 shown in fig. 8, the processor 804 may be configured to invoke the computer program stored in the memory 808 to implement:

carrying out image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set comprises a first augmented image set, the first augmented image set comprises a first augmented image and a category label of the first augmented image, the first augmented image is obtained after the first sample image is subjected to image enhancement processing, and the category label of the first augmented image is consistent with the category label of the first sample image;

the training data set is used for selecting the training data set of the target task from the sample image set and the auxiliary image set, and the training data set of the target task is adopted to train a first model; the training data set of the target task includes at least the first sample image set and the first augmented image set;

selecting a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and training a second model by adopting the training data set of the auxiliary task; the second model shares a model structure and model parameters of a feature extractor with the first model;

the processor 804 trains the first model using the training data set of the target task, including:

the processor 804 trains a second model using the training data set of the auxiliary task, including:

the second model is a first auxiliary model and comprises a second feature extractor and a second classifier connected with the second feature extractor; the first feature extractor and the second feature extractor share a model structure and model parameters;

the second model is a second auxiliary model and comprises two third feature extractors and similar discriminators; the first feature extractor and the third feature extractor share a model structure and model parameters;

in one possible implementation, the first model is a classification model based on image features; the first model comprises a first feature extractor and a first classifier connected with the first feature extractor

The second model comprises a first auxiliary model and a second auxiliary model, the second model comprises a second feature extractor and a second classifier connected with the second feature extractor, and comprises two third feature extractors and similar discriminators respectively connected with the two third feature extractors; the first feature extractor shares a model structure and model parameters with the second and third feature extractors.

In one possible implementation, the objective task includes identifying an image category; processor 804 is also configured to perform the following operations:

and displaying the image category on the image recognition interface.

It should be understood that the computer device 800 described in this embodiment of the present application may perform the description of the test analysis method in the embodiment corresponding to fig. 3 to fig. 6b, and may also perform the description of the test analysis apparatus in the embodiment corresponding to fig. 7, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus and system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the unit is only a logic function division, and there may be another division manner in actual implementation; for example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer storage medium, where the computer storage medium stores a computer program executed by the aforementioned test analysis apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the method in the embodiment corresponding to fig. 3 to 6b can be executed, and therefore, details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network, which may comprise a block chain system.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device can execute the method in the embodiment corresponding to fig. 3 to fig. 6b, which will not be described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

carrying out image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set at least comprises a first augmented image set, the first augmented image set comprises a first augmented image and a category label of the first augmented image, the first augmented image is obtained after the first sample image is subjected to image enhancement processing, and the category label of the first augmented image is consistent with the category label of the first sample image;

selecting a training data set of the target task from the sample image set and the auxiliary image set, and training a first model by adopting the training data set of the target task; the training data set of the target task includes at least the first sample image set and the first augmented image set;

2. The method of claim 1, wherein the set of sample images further comprises a third set of sample images, the third set of sample images comprising a third sample image and a class label for the third sample image; the third sample image is an image in a non-target field corresponding to the target task; the training dataset of the target task further comprises the third sample image set;

the training of the first model with the training data set of the target task includes:

3. The method of claim 1, wherein the set of auxiliary images further comprises a first set of sliced images and a second set of sliced images, the first set of sliced images containing a first sliced image and a location tag for the first sliced image; the first segmentation image is obtained by carrying out image segmentation processing on the first sample image; the second segmentation image set comprises a second segmentation image and a position label of the second segmentation image, and the second segmentation image is obtained by carrying out image segmentation processing on the second sample image; the training data set of the auxiliary task comprises the first and second sliced image sets;

training a second model using the training data set of the auxiliary task, comprising:

4. The method of claim 1, wherein the set of auxiliary images further comprises a second set of augmented images, the second set of augmented images comprising a second augmented image obtained by image enhancement processing of the second sample image; the training dataset of the auxiliary task comprises the sample image set and the auxiliary image set;

5. The method of claim 1, wherein the second model comprises a first auxiliary model and a second auxiliary model; the first auxiliary model is a location-based classification model; the second auxiliary model is a discriminant model based on feature similarity.

6. The method according to any of claims 1-3, wherein the first model is a classification model based on image features; the first model comprises a first feature extractor and a first classifier connected with the first feature extractor;

the second model is a first auxiliary model which comprises a second feature extractor and a second classifier connected with the second feature extractor; the first feature extractor shares a model structure and model parameters with the second feature extractor.

7. The method of claim 1, 2 or 4, wherein the first model is a classification model based on image features; the first model comprises a first feature extractor and a first classifier connected with the first feature extractor;

the second model is a second auxiliary model which comprises two third feature extractors and a similar discriminator; the first feature extractor and the third feature extractor share a model structure and model parameters.

8. The method of claim 1 or 5, wherein the first model is a classification model based on image features; the first model comprises a first feature extractor and a first classifier connected with the first feature extractor;

the second model comprises a first auxiliary model and a second auxiliary model, the first auxiliary model comprises a second feature extractor and a second classifier connected with the second feature extractor, and the second auxiliary model comprises two third feature extractors and similar discriminators respectively connected with the two third feature extractors; the first feature extractor shares a model structure and model parameters with the second and third feature extractors.

9. The method of claim 1, wherein the target task comprises identifying a category of images; the method further comprises the following steps:

and displaying the image category on the image recognition interface.

10. An image processing apparatus characterized by comprising:

the processing unit is used for carrying out image conversion processing on the sample image set to obtain an auxiliary image set; the auxiliary image set comprises a first augmented image set, the first augmented image set comprises a first augmented image and a category label of the first augmented image, the first augmented image is obtained after the first sample image is subjected to image enhancement processing, and the category label of the first augmented image is consistent with the category label of the first sample image;

the training unit is used for selecting a training data set of the target task from the sample image set and the auxiliary image set and training a first model by adopting the training data set of the target task; the training data set of the target task includes at least the first sample image set and the first augmented image set;

the training unit is further used for selecting a training data set of an auxiliary task of the target task from the sample image set and the auxiliary image set, and training a second model by adopting the training data set of the auxiliary task; the second model shares a model structure and model parameters of a feature extractor with the first model;