CN110728255A

CN110728255A - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN110728255A
Application number: CN201911007790.5A
Authority: CN
Inventors: 孙莹莹
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-01-24
Anticipated expiration: 2039-10-22
Also published as: CN110728255B; WO2021078157A1

Abstract

The application discloses an image processing method, an image processing device, electronic equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: acquiring image data to be processed; inputting image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data, wherein each specific network is used for determining the attribute labels corresponding to the image data, and the attribute labels determined by each specific network are different from each other; inputting the attribute labels determined by each specific network into a pre-trained shared network to obtain an image recognition result; and outputting an image recognition result. Therefore, the plurality of specific networks jointly analyze the image data and obtain the plurality of attribute tags, the obtaining speed of the attribute tags can be improved, the shared network can obtain the image recognition result by combining the correlation of the attribute tags, and the accuracy and the overall performance of the recognition result are improved.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

The existing image attribute identification technical scheme mainly comprises an attribute identification scheme based on traditional machine learning, an attribute identification scheme based on a convolutional neural network model and the like. However, the conventional image attribute recognition technology is most often based on a single model and realizes single attribute judgment, but is not efficient in multi-attribute recognition.

Disclosure of Invention

The application provides an image processing method, an image processing device, an electronic device and a storage medium, so as to overcome the defects.

In a first aspect, an embodiment of the present application provides an image processing method, including: acquiring image data to be processed; inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data, wherein each specific network is used for determining the attribute labels corresponding to the image data, and the attribute labels determined by each specific network are different from each other; inputting the attribute labels determined by each specific network into a pre-trained shared network to obtain an image recognition result, wherein the shared network is used for determining the image recognition result according to the attribute labels and the correlation of the attribute labels; and outputting the image recognition result.

In a second aspect, an embodiment of the present application further provides an image processing method, including: obtaining a plurality of sample image data, wherein each sample image data corresponds to a plurality of attribute labels; setting a sharing network and a plurality of specific networks, wherein each specific network can identify at least one attribute label, and the attribute labels which can be identified by each specific network are different from each other; inputting the sample image data into the shared network and the specific networks for training to obtain a trained shared network and specific networks; and acquiring image data to be processed, and processing the image data to be processed according to the trained shared network and the plurality of specific networks to obtain an image recognition result.

In a third aspect, an embodiment of the present application further provides an image processing apparatus, including: the device comprises a data acquisition unit, an attribute determination unit, a result acquisition unit and an output unit. And the data acquisition unit is used for acquiring the image data to be processed. The attribute determining unit is configured to input the image data to be processed into a plurality of pre-trained specific networks to obtain attribute tags corresponding to the image data, where each specific network is configured to determine the attribute tag corresponding to the image data, and the attribute tags determined by the specific networks are different from each other. And the result acquisition unit is used for inputting the attribute labels determined by each specific network into a pre-trained shared network so as to acquire an image recognition result, wherein the shared network is used for determining the image recognition result according to the attribute labels and the correlation of the attribute labels. And the output unit is used for outputting the image recognition result.

In a fourth aspect, an embodiment of the present application further provides an image processing apparatus, including: the device comprises a sample acquisition unit, a setting unit, a network training unit and a recognition unit. The system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a plurality of sample image data, and each sample image data corresponds to a plurality of attribute labels. The device comprises a setting unit and a control unit, wherein the setting unit is used for setting a sharing network and a plurality of specific networks, each specific network can identify at least one attribute label, and the attribute labels which can be identified by each specific network are different from each other. And the training unit is used for inputting the sample image data into the shared network and the specific networks for training so as to obtain the trained shared network and the specific networks. And the identification unit is used for acquiring image data to be processed, and processing the image data to be processed according to the trained shared network and the plurality of specific networks to obtain an image identification result.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the above-described methods.

In a sixth aspect, the present application also provides a computer-readable storage medium, where the readable storage medium stores program code executable by a processor, and a plurality of instructions in the program code, when executed by the processor, cause the processor to execute the above method.

The image processing method, the image processing device, the electronic device and the storage medium, which are provided by the application, are characterized in that a shared network and a plurality of specific networks are trained in advance, each specific network is used for determining an attribute label corresponding to the image data, and the attribute labels determined by the specific networks are different from each other, when the image data to be processed is obtained, the image data to be processed is input into the specific networks, each specific network can identify the attribute which can be identified by the specific network, so that the attribute labels corresponding to the image data to be processed can be identified by the specific networks respectively, the identification of the attribute labels of the whole image data is improved, the attribute labels corresponding to the image data can be obtained, then the attribute labels corresponding to the image data are input into the shared network, and the shared network determines the image identification result according to the correlation between the attribute labels and the attribute labels, and outputs the image recognition result. Therefore, the plurality of specific networks jointly analyze the image data and obtain the plurality of attribute tags, the obtaining speed of the attribute tags can be improved, the shared network can obtain the image recognition result by combining the correlation of the attribute tags, and the accuracy and the overall performance of the recognition result are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method of image processing according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of image processing according to another embodiment of the present application;

FIG. 3 illustrates a method flow diagram of an image processing method provided by yet another embodiment of the present application;

fig. 4 illustrates a flowchart of a method of S310 in the image processing method illustrated in fig. 3 according to an embodiment of the present application;

fig. 5 illustrates a flowchart of a method of S310 in the image processing method illustrated in fig. 3 according to another embodiment of the present application;

FIG. 6 shows a schematic diagram of a measurement region provided by an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating connection between a specific network and a shared network provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of sub-image data provided by an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating the orientation of a human face provided by an embodiment of the present application;

FIG. 10 is a flow chart of a method of image processing according to yet another embodiment of the present application;

fig. 11 shows a block diagram of an image processing apparatus according to an embodiment of the present application;

fig. 12 shows a block diagram of an image processing apparatus according to another embodiment of the present application;

fig. 13 shows a block diagram of an image processing apparatus according to still another embodiment of the present application;

FIG. 14 shows a block diagram of an electronic device provided by an embodiment of the present application;

fig. 15 shows a storage unit for storing or carrying program codes for implementing the graphics processing method according to the embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The human face recognition is a technology for identifying the identities of different people based on the human face appearance characteristics, the application scenes are wide, and related research and application have been carried out for decades. With the development of related technologies such as big data and deep learning in recent years, the face recognition effect is improved dramatically, and the face recognition method is applied to scenes such as identity authentication, video monitoring and beauty and entertainment. The problem of human-certificate comparison, namely the problem of face recognition between a standard certificate photo and a life photo, is solved because the target person is recognized only by deploying the certificate photo in a database, the trouble that the target person collects the life photo in the system for registration is avoided, and more attention is paid.

The existing face attribute recognition technical scheme mainly comprises an attribute recognition scheme based on traditional machine learning, an attribute recognition scheme based on a Convolutional Neural Network (CNN) model and the like.

In some face recognition technologies, a concept of multi-task learning is used for reference, after a face region is extracted from an image or an image by applying a face detection algorithm, a convolutional layer of a preset analysis task in a face library is learned by using a convolutional neural network, a face analysis model is obtained, and the prediction of face emotion is completed.

In other face recognition technologies, based on the idea of multitask learning, multitask cascade learning is achieved by adding auxiliary information such as gender, smile, glasses wearing, posture and the like in the training process, but the technology takes various attributes of the face as labels and achieves face alignment through cascade.

The multi-task learning method not only carries out single face attribute recognition, but also can realize multi-attribute prediction. For example, a multitask learning method is introduced into the ethnicity and gender recognition of the face image, different semantics are taken as different tasks, multitask feature selection based on the semantics is provided and applied to the ethnicity and gender recognition, but when a network structure is constructed, the ethnicity and gender are still taken as two tasks to be solved independently, a large amount of redundancy exists in a model, and real-time prediction cannot be achieved.

Therefore, the most common existing face attribute recognition technology is based on a single model and realizes single attribute judgment, that is, only one task is learned at a time under a unified model, a complex problem is firstly decomposed into theoretically independent sub-problems, and in each sub-problem, a sample in a training set only reflects information of a single task. However, the face image contains various attribute information such as race, gender, age, and the like, and there is correlation between recognition tasks corresponding to different information, and certain correlation information is shared among the tasks in the learning process. The multi-task learning method is introduced into the face image ethnicity, gender and age recognition, different semantics are taken as different tasks, the multi-task feature selection based on the semantics is provided, and the method is applied to multi-attribute recognition and can obviously improve the generalization capability and recognition effect of a learning system.

Although, in view of the above problems, a face image race and gender recognition method based on multitask learning has appeared. The method introduces a multitask learning method into the ethnic and gender recognition of the face image, takes different semantics as different tasks, and provides the multitask feature selection based on the semantics, and although the generalization capability and the recognition effect of a learning system are obviously improved, the traditional machine learning mode is adopted, and the efficiency is greatly reduced.

The existing patent also proposes to combine the deep learning technology with the attribute recognition task, but only adopts a multi-task learning mode, proposes a three-stage training process, and learns three characteristics of a face part, a face action unit and an emotion space value on a convolutional network respectively to complete the task of face emotion analysis, and does not realize the multi-attribute output result.

Because, in order to solve the above-mentioned defects, an embodiment of the present application provides an image processing method, as shown in fig. 1, the method including: s101 to S104.

S101: image data to be processed is acquired.

The image data may be an offline image file that has been downloaded in the electronic device in advance, or an online image file.

The online image data corresponds to a certain frame of image or a plurality of frames of images in a video file, and the online image data is data that the video file has been sent to the electronic device, for example, the video file is a certain movie, the electronic device receives data that the playing time of the certain movie is 0 to 10 minutes, and the online image data corresponding to the certain movie is data that the playing time of the certain movie is 0 to 10 minutes. The client can decode each online image data after acquiring each online image data and obtain the corresponding layer to be rendered, and then merge and display the layers, so that a plurality of video pictures can be displayed on the screen.

As an implementation manner, the electronic device includes a plurality of clients capable of playing video files, when a client of the electronic device plays a video, the electronic device can acquire a video file to be played, and then decode the video file, specifically, the above soft decoding or hard decoding may be adopted to decode the video file, after decoding, multi-frame image data to be rendered corresponding to the video file can be acquired, and then the multi-frame image data is required to be rendered and then displayed on a display screen.

As another embodiment, the image data may also be an image captured by a camera of the electronic device by a specific application program in the electronic device, specifically, when the specific application program executes a certain function, the camera is called to capture the image, and the electronic device is requested to determine an image recognition result by the method of the present application, and the image recognition result is sent to the specific application program, and the specific application program executes a corresponding operation according to the image recognition result.

S102: and inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data.

Each specific network is used for determining an attribute label corresponding to the image data, and the attribute labels determined by the specific networks are different from each other.

Specifically, when learning the specificity network in advance, the sample image data input into the specificity network includes a plurality of attribute labels, for example, the color of hair in a face image is black, a white car in a vehicle image, and the like, and the value of each attribute label is 0 or 1, 0 indicates that the attribute is not present, 1 indicates that the attribute is not present, and the attribute labels are feature values of the image preset for obtaining an image recognition result, and the role of the specificity network determines whether the preset attribute label is included in the image.

Specifically, each specific network is capable of determining an attribute label corresponding to the image data, and in some embodiments, may be capable of determining at least one attribute label corresponding to the image data, for example, the plurality of specific networks include a first specific network and a second specific network, and the attribute labels include a label 1, a label 2, and a label 3, the first specific network is used to identify the identification result of the label 1 corresponding to the image data, that is, if the image data includes the label 1, the image data identified by the first specific network corresponds to the label 1, or an identification result about the label 1 is given, that is, the label 1 is 1, and if the label 1 is not present, the identification result is given that the label 1 is 0. And the second specific network is used for determining the tags 2 and 3, so that the tags 1, the tags 2 and the tags 3 are separately identified by different specific networks, the identification efficiency can be improved, the problem that the calculated amount is too large due to the fact that the tags 1, the tags 2 and the tags 3 are identified by the same specific network is avoided, and the first specific network is only used for identifying the tags 1, so that the identification of the tags 2 and 3 does not need to be learned and trained, and the training cost is reduced.

It should be noted that, a plurality of specific networks may be executed simultaneously, that is, a plurality of threads may operate simultaneously, rather than a cascade relationship, that is, the output result of a specific network does not need the input of other specific networks. Specifically, the structure of the specific network is presented in the subsequent examples.

In the embodiment of the application, the specific network mainly functions to segment a target object from an image and identify the target object, that is, the specific network may also be a target detection network, and obviously, the specific network combines the segmentation and identification of the target object into one. The target detection networks commonly used include GOTURN network, MobileNet-SSD deep convolutional neural network, FasterRCNN neural network, Yolo neural network, and SPP-Net (spatial Pyramid) neural network. The GOTURN neural network is a target detection algorithm which utilizes a convolutional neural network to carry out off-line training, and the characteristic is extracted and identified by utilizing a CNN classification network pre-trained by the existing large-scale classification data set.

S103: and inputting the attribute label determined by each specific network into a pre-trained shared network to obtain an image recognition result.

The shared network is used for determining an image recognition result according to the attribute tags and the correlation of the attribute tags. Specifically, the shared network focuses on learning shared information of all attribute tags, for example, when an attribute tag of a mouth corner rising and an attribute tag of a whitish eye appear at the same time, the expressed emotion is thought, and the correlation between the attribute tag of the mouth corner rising and the attribute tag of the whitish eye is a recognition result recognized through the shared network and obtained according to the correlation. That is, after the shared network is trained in advance, the correlation between the attribute labels can be identified, and the image identification result can be obtained according to the correlation.

S104: and outputting the image recognition result.

The method for outputting the image recognition result may be displaying the image recognition result on a screen, or sending the image recognition result to a request end requesting to obtain the image recognition result, where the request end may be a server communicating with the electronic device, or other electronic devices, or an application installed in the electronic device, and the execution subject of the method may be that an application capable of image recognition in the electronic device may be an operating system of the electronic device, and after obtaining the image recognition result, the image recognition result is sent to the request end, and the request end executes a certain operation, such as transaction payment or screen unlocking, according to the image recognition result.

Referring to fig. 2, an image processing method provided in an embodiment of the present application is shown, where the method includes: s201 to S207.

S201: raw image data is acquired.

The original image data may be a gray scale value corresponding to the image, that is, a value of each pixel in the image is a value in the interval of [0,255], that is, a gray scale value. As an embodiment, when the electronic device acquires an image, the image may be a color image, and then the color image is binarized to obtain a gray-scale map, and the gray-scale value of each pixel in the gray-scale map constitutes the data of the original image.

In addition, it should be noted that the raw image data may be data collected by a camera of the electronic device, for example, and the image processing method is applied to real-time analysis of the image data collected by the camera, and the analysis is directed to analysis of the face recognition attribute.

Specifically, the image data may also be an image captured by a camera of the electronic device by a specific application program in the electronic device, specifically, when the specific application program executes a certain function, the camera is called to capture the image, and the electronic device is requested to determine an image recognition result by the method of the present application, and the image recognition result is sent to the specific application program, and the specific application program executes a corresponding operation according to the image recognition result. The specific application program can be a screen unlocking APP in the electronic device or a payment APP. For example, the screen unlocking APP performs face recognition on a face image acquired by the camera to determine identity information, judges whether the face image is matched with a preset face image, determines successful unlocking if the face image is matched with the preset face image, and determines unsuccessful unlocking if the face image is not matched with the preset face image.

The preset face image may be a face image preset by a user, may be stored in a mobile terminal, or may be stored in a certain server or a certain memory, and the mobile terminal may obtain the face image to be preset from the server or the certain memory. Specifically, the preset feature information of the face image may be preset, if the face image is a two-dimensional image, the preset feature information is facial feature point information of the face image previously input by the user, and if the face image is a three-dimensional image, the preset feature information is facial three-dimensional information of the face image previously input by the user. And judging whether the face image meets the preset condition or not by acquiring feature point information of the face image, comparing the acquired feature information of the face image with preset feature information input by a user in advance, if so, judging that the face image meets the preset condition, determining that the face image has the authority to unlock the screen of the mobile terminal, and if not, judging that the face image does not meet the preset condition, and if not, unlocking the screen.

In this embodiment of the present application, the image data includes a face image, and for the identification of the image data, the image data to be processed is acquired, and it may be determined whether the image data includes a face first, and if the image data includes a face, the subsequent operation is performed. Specifically, an image acquired by the camera is a two-dimensional image, whether a face image is acquired can be determined by searching whether a facial feature point exists in the image, and if the facial feature point is acquired, the acquired face image is sent to a processor of the mobile terminal, so that the processor can analyze the face image and execute screen unlocking operation. As another embodiment, the camera includes structured light, and determines whether human face three-dimensional information exists according to three-dimensional information collected by the structured light, and if so, sends the collected image to the processor of the mobile terminal.

In addition, if the image acquired by the camera does not include the face image, the operation of continuously judging whether the image acquired by the camera includes the face image is returned, and face acquisition reminding information can be sent to remind a user of using the camera to acquire the face image. Specifically, the face collection reminding information may be displayed on a current interface of the electronic device.

S202: and normalizing the original image data to obtain image data to be processed.

Each pixel value in the original image is normalized, namely the original value of 0-255 is changed into a value in the interval of 0-1, so that the calculation speed of a subsequent specific network and a subsequent shared network can be increased, the speed of the whole image processing is increased, and specifically, the normalization processing of the original image data can be performed by adopting a mean variance normalization or gray level conversion normalization mode.

In addition, redundant information, which refers to the difference between the compressed distributions, may be removed after the normalization processing of the original image data.

The raw image data after the normalization process is used as the image data to be processed.

S203: and determining attribute labels corresponding to each specific network.

Specifically, the attribute tags that can be identified by the specific network are the attribute tags corresponding to the specific network. The attribute labels that can be identified by the specific network are set when the specific network is trained, and in particular, reference is made to the following embodiments.

S204: and dividing the image data into a plurality of sub-image data according to the attribute label corresponding to each specific network.

Since each attribute label in the image data corresponds to a position in the image, taking a face image as an example, the position in the image corresponding to the attribute label of the color of hair is the hair position, and the position in the image data corresponding to the attribute label of the color of eyes is the eye position, so that the position corresponding to each attribute label can be determined in advance, for example, when a specific network is trained, the area corresponding to each attribute label is set. Furthermore, the image data can be divided into a plurality of sub-image data, each sub-image data corresponds to an area in the image, and the attribute labels in the area all correspond to the same specific network, that is, the attribute label that can be identified by each specific network is located in the area corresponding to the sub-image data.

Thus, the image sub-area corresponding to each specific network, i.e. the sub-image data corresponding to each specific network, can be obtained. For example, the image is divided into a first area, a second area, and a third area, the attribute tags recognizable by the first specific network are distributed in the first area, the attribute tags recognizable by the second specific network are distributed in the second area, and the attribute tags recognizable by the third specific network are distributed in the third area, so that the image data is divided into three sub-image data, which are the first sub-image data, the second sub-image data, and the third sub-image data, respectively, so that the first sub-image data corresponds to the first area, the second sub-image data corresponds to the second area, and the third sub-image data corresponds to the third area.

In addition, in order to divide a plurality of regions more preferably based on the pixel coordinates, the images may be adjusted in one direction so that the designated regions are located in the same position. The face image is used, when the original image is acquired, a face area in the original image is captured, and image rotation is adjusted according to a certain direction, so that the face is fixed to face a certain position, for example, the forehead of the face is always located at the upper part of the image, and the chin of the face is located at the lower part of the image.

S205: and inputting the sub-image data into a specific network corresponding to the sub-image data.

As an embodiment, after determining the area in the image where the attribute tag corresponding to each specific network is located, the image may be divided into a plurality of areas, where each area corresponds to one specific network, the image is divided into a plurality of sub-images according to the plurality of areas determined, and the pixel data corresponding to each sub-image is the sub-image data, that is, the pixel data after the processing in S202.

For example, the first area, the second area, and the third area are described above, then, the image is divided into three sub-images, which are the first sub-image, the second sub-image, and the third sub-image, then, the pixel data corresponding to each pixel value in the first sub-image is input into the first specificity network, the pixel data corresponding to each pixel value in the second sub-image is input into the second specificity network, and the pixel data corresponding to each pixel value in the third sub-image is input into the third specificity network.

Therefore, the whole image data is not required to be respectively input into each specific network, and only the sub-image data corresponding to the attribute labels which can be identified by the specific network is input into the specific network, so that the calculation amount of the specific network is reduced, and the overall identification speed is improved.

S206: and inputting the attribute label determined by each specific network into a pre-trained shared network to obtain an image recognition result.

S207: and outputting the image recognition result.

It should be noted that, for the parts not described in detail in the above steps, reference may be made to the foregoing embodiments, and details are not described herein again.

In addition, before executing S102 or S204, the specific network and the shared network need to be trained, specifically, the training process may be after S101 and before S102 or after S201 and before S204, or may be before S101 and S201, and in this embodiment, the specific network and the shared network may be trained before executing the image recognition method.

Specifically, referring to fig. 3, a training process of training a specific network and a shared network in an image processing method provided in an embodiment of the present application is shown, specifically, as shown in fig. 3, the method includes: s310 to S370.

S310: obtaining a plurality of sample image data, wherein each sample image data corresponds to a plurality of attribute labels.

Specifically, the sample image data is image data that has been marked, and may be an image acquired in advance, and the image is marked manually, and each marked point corresponds to an attribute tag. For example, the sample image data may be a CelebA face attribute dataset as the experimental dataset. The data set contains about 20 million face images, each of which provides 40 face attribute labels and 5 face keypoint location information. According to the standard of CelebA official, about 10 ten thousand face images are taken for training the network model, about 1 ten thousand images are used for verification, and 1 ten thousand images are used for testing the network model.

For the disclosed face attribute data set, 40 attribute labels corresponding to each face picture can be obtained, the value of each label is 0 or 1, 0 indicates that the attribute is not provided, and 1 indicates that the attribute is not provided.

It should be noted that both the sample image data and the image data to be processed include face images, that is, the image processing method provided in the embodiment of the present application is applied to face attribute recognition, and the training of the specific network and the shared network is also trained for face attribute recognition.

Further, in order to improve the accuracy of image recognition, the human face may be aligned, and specifically, referring to fig. 4, S310 may include: s311 to S314.

S311: a plurality of sample image data is acquired.

Specifically, the step can refer to the above description, and is not repeated herein.

S312: and identifying the position information of the face key points in each sample image data in the sample image.

Specifically, the face key points may be five sense organs in the face image, for example, the face key points may be eyes, a nose, a mouth, and the like. The specific recognition method may be a five sense organs recognition method of the face image, for example, a pca (principal component analysis) analysis method, to determine five sense organs in the face image, so as to determine position information, i.e., pixel coordinates, of the face key points in the face image.

Specifically, after a sample image is acquired, a face region is cut, the sample data to be trained is subjected to face correction, and position information of key points of the face, such as eyes, a nose, a mouth, and the like, is determined.

S313: and adjusting the face orientation in each sample image data to accord with a preset orientation according to the position information of the face key point of each sample image data.

Specifically, the preset orientation may be that the face faces straight ahead, specifically, the meaning that the face faces straight ahead is that the forehead portion of the face is in the upper part of the image and the chin portion of the face is in the lower part of the image. Specifically, the orientation of the face in the image can be determined by the position information of the key points of the face. Specifically, the same pixel coordinate system is set for each sample image, that is, the pixel coordinate system may be established with the vertex at the top of the left side of the sample image as the origin, so that the pixel coordinates of the key points of the face in the face image can be obtained, and the position relationship between the forehead and the chin of the person can be determined by the position information of the key points of the face, specifically, if it is determined that the eyes are on the left side of the image and the mouth is on the right side of the image, and the difference between the vertical coordinates of the eyes and the mouth is smaller than a specified value, it may be determined that the eyes and the mouth are on the same horizontal line, and the face orientation in the sample image data may be made to conform to the preset orientation by rotating clockwise by 90 °. As an embodiment, the preset orientation may be that the face of the person faces within 15 degrees of the front.

In addition, after the face orientation in each sample image data is adjusted to conform to the preset orientation according to the position information of the face key point of each sample image data, in order to reduce the calculation amount, the size of the sample image may also be adjusted, so as to perform size adjustment, specifically, the direction of the object to be predicted is adjusted according to the preset direction standard through the positioning of the face key point, such as eyes, nose, mouth, and the like, so as to ensure that the face of each object to be predicted faces within 15 degrees of the front, achieve face alignment, and add a margin of a predetermined proportion to the face area, and at the same time, in order to reduce the calculation amount, the image size is set to be a specified size, for example, the specified size may be 112 × 112. Specifically, the size of the entire image may be reduced to a predetermined size, or the sample image may be cut out in a window of a predetermined size, and specifically, the center point of the sample image may be used as the center point of the window, and an image in an image area corresponding to the window size may be detected as the resized image. As an embodiment, the window may be 112 × 112 in size.

S314: and taking the adjusted sample image data as the sample image data used for training the shared network and the plurality of initial specific networks.

Specifically, if the adjustment is to adjust the face orientation in each sample image data to conform to a preset orientation according to the position information of the face key point of each sample image data, the sample image data after the adjustment of the face orientation is used as the sample image data used for training the shared network and the plurality of initial specificity networks at this time, and if the adjustment includes adjusting the face orientation in each sample image data to conform to the preset orientation and adjusting the size of the sample image to a specified size according to the position information of the face key point of each sample image data, the sample image data after the adjustment of the face orientation and the size adjustment is used as the sample image data used for training the shared network and the plurality of initial specificity networks at this time.

Further, in order to increase data samples and improve generalization of the trained image processing model, i.e. the face recognition model, i.e. the shared network and the multiple specific networks, in particular, referring to fig. 5, S310 may include: s311, S315, S316, and S317.

S311: a plurality of sample image data is acquired.

S315: and performing data enhancement processing on the plurality of sample image data to ensure that the illumination intensity and the contrast of each sample image data are randomly distributed in a preset interval.

Specifically, the illumination intensity of the objects to be trained is changed according to a preset illumination intensity interval, and data that the illumination intensity of each object to be trained is randomly distributed in the preset illumination intensity interval is obtained; and transforming the contrast of the objects to be trained according to a preset contrast interval to obtain data of random distribution of the contrast of each object to be trained in the preset contrast interval.

Specifically, the preset illumination intensity interval may be a preset illumination intensity region, and after the illumination intensity of each pixel point in the sample image is obtained, the illumination intensity of the pixel point may be adjusted to the preset illumination intensity interval. As an implementation manner, the distribution condition of the illumination intensity of each pixel point in the sample image may be counted, so that the pixel point with higher illumination intensity is also located at the numerical value with higher illumination intensity in the preset illumination intensity interval, and the pixel point with lower illumination intensity is also located at the numerical value with lower illumination intensity in the preset illumination intensity interval, in addition, the continuity of the distribution of the illumination intensity of each pixel point in the preset illumination intensity interval may be increased, that is, the intensity value difference between two adjacent sub-areas is not greater than the specified numerical value in the illumination intensity distribution sub-area of a plurality of pixel values, so that the illumination intensity is randomly distributed in the preset illumination intensity interval, thereby increasing the diversity of data, and specifically, the illumination intensity of the pixel point in each sample image data may be randomly distributed in the corresponding preset illumination intensity interval, and the preset illumination intensity intervals corresponding to each sample image data can not be all the same, so that the diversity of the data is further increased, and the generalization of the later-stage model training is further improved.

Similarly, the contrast of the object to be trained is transformed according to the preset contrast interval, and the adjustment process of the illumination intensity can be referred to for obtaining the data of the random distribution of the contrast of each object to be trained in the preset contrast interval. Therefore, the contrast of the pixel points in each sample image data can be randomly distributed in the corresponding preset contrast interval, and the preset contrast intervals corresponding to each sample image data can not be all the same, so that the diversity of the data is further increased, and the generalization of the later-stage model training is further improved.

Before the enhancement processing is performed, the processed sample data may be normalized to normalize the pixel value from [0,255] to [0,1], and redundant information included in the sample data may be removed.

S316: and each piece of sample image data after enhancement processing is cut according to a preset random cutting proportion, and the size of each piece of cut sample image data is a preset size.

Cutting the object to be trained according to a preset random cutting proportion, and adjusting the object to be trained to be a preset size, wherein the preset size can be 112 x 112; and turning the object to be trained in the horizontal direction.

It should be noted that, the above-mentioned object to be trained is cut according to the preset random cutting proportion, and the preset size is adjusted to the preset size, referring to the cutting mode of the above-mentioned object to be trained, so that the preset size is the same as the specified size.

S317: and taking the clipped sample image data as the sample image data used for training the shared network and the plurality of initial specific networks.

It should be noted that, the above steps S311 to S314 may replace S310, that is, S320 is executed after S311, S312, S313 and S314, steps S311, S315, S316 and S317 replace S310, that is, S320 is executed after S311, S315, S316 and S317, and steps S311 to S317 replace S310 together, that is, S311, S312, S313, S314, S315, S316 and S317 are executed after S320 is executed.

S320: a shared network and a plurality of specific networks are set.

Each of the specific networks is capable of identifying at least one attribute tag, and the attribute tags that each of the specific networks is capable of identifying are different from each other. There is a significant computational overhead if a specific network is configured for each attribute tag. For example, taking face image recognition as an example, assuming that the attribute tags include 40 in total, if 40 face attributes are regarded as 40 independent tasks, there is a huge calculation overhead in directly considering 40 face attributes as 40 independent tasks, and the displayed position correlation between the face attributes is ignored. Therefore, a plurality of regions may be divided for the face, and each region corresponds to one specific network, specifically, a specific implementation of setting the shared network and the plurality of specific networks may be to divide a plurality of measurement regions, each of which corresponds to a different region of the face; and setting a plurality of specific networks according to the plurality of measurement areas, wherein each specific network corresponds to one measurement area and is used for confirming the attribute labels in the corresponding measurement area.

Specifically, four specific networks may be provided, an upper specific network, a middle specific network, a lower specific network, and a full-face specific network, respectively. Correspondingly, the attribute tags are divided into four groups, namely an upper group, a middle group, a lower group and a full-face group, each group corresponds to the attribute tag, and the attribute tags of the groups are different, that is, each specific network can identify the attribute tag of the corresponding group. The attribute classification of each group can be treated as a separate attribute learning task according to their respective locations. After the attributes are divided into 4 groups, the attribute classification problem of each attribute group is regarded as a subtask, specifically, the attribute labels are as shown in the following table:

as shown in fig. 6, the sample image is divided into four regions, i.e., an upper region m1, a middle region m2, a lower region m3, and a full-face region m4, and as an embodiment, the upper region m1 is a region between the top side of the image and the abscissa of the position of the eye positioned most downward in both eyes, and specifically, as shown in fig. 6, a straight line parallel to the abscissa axis is provided on the abscissa of the position of the eye positioned most downward in both eyes, and is referred to as a first straight line, and a region between the first straight line and the top side of the image is defined as the upper region. One position point is selected in the area between the nose and the upper lip, the position point can be a middle position point, a straight line parallel to the abscissa axis is arranged and is marked as a second straight line, the area between the first straight line and the second straight line is used as a middle area, the area between the bottom side edge of the second execution sum image and the bottom side edge of the second execution sum image is used as a lower area, and the area between the tail end of the chin of the face and the top end of the hair of the face is used as a full face area, wherein the full face area can enclose the face and the hair. Among them, the above-described regions are measurement regions, i.e., the upper region m1, the middle region m2, the lower region m3, and the full-face region m4 are four measurement regions.

In the embodiment of the application, a total of four specific networks and one shared network are included, so that each task has an independent specific network. Unlike the branched structure, the parameters of the specific network between tasks are not shared in order to better preserve the individual specificity of the tasks. The shared network is used as an independent grid and does not correspond to a specific learning task, but extracts complementary information among tasks in order to learn the correlation among the tasks. The specific network and the shared network are connected through a simple connecting unit, so that the aim of maximizing the information flow between the specific network and the shared network is fulfilled. Specifically, as shown in fig. 7, the connection relationship between the specific network and the shared network is as shown in fig. 7, and each layer input of the shared network includes output characteristics of all specific networks in the previous layer in addition to output characteristics of the previous layer. These features are concatenated together to make up the input to each layer of the specificity network. Meanwhile, each layer of input of the specific network comprises output characteristics of a shared network of an upper layer besides output characteristics of the upper layer. The two are connected in series to form the final input. In addition, fig. 7 only shows the connection relationship between two specific networks and the shared network, and the connection relationship for four specific networks can be reasonably derived by referring to the illustration in fig. 7.

The multitask attribute recognition model constructed in the embodiment of the application comprises 4 specific networks and 1 shared network. The specific network learns the specific features of each feature group with emphasis, and the shared network learns the shared information of all feature groups with emphasis. The specific network and the shared network are connected and information interaction is carried out through the local sharing unit, and therefore the whole local sharing multitask face multi-attribute classification network is formed. Each of the specific network and the shared network has the same network structure. All containing 5 convolutional layers and 2 fully-connected layers. Meanwhile, each convolutional layer and fully-connected layer is followed by a normalization layer and a ReLU (rectified Linear Unit) activation layer. The number of output channels of each layer between the specific networks is the same, and the number of output channels of the shared network is different from that of the specific networks.

S330: and inputting the plurality of sample image data into the shared network and the plurality of specific networks for training so as to obtain the trained shared network and the plurality of specific networks.

As an embodiment, the sample image data may be input to each specific network in its entirety for training, but each specific network can only recognize the corresponding attribute label, for example, the upper specific network can recognize the attribute label of the upper group, and the other attribute labels cannot recognize.

As another embodiment, in order to reduce the amount of computation and the training speed, different portions of different sample image data may be input for each specific network, and specifically, the sample image data is divided into a plurality of sub-sample image data according to the attribute label corresponding to each specific network, and the sub-sample image data is input to the specific network corresponding to the sub-sample image data.

As shown in fig. 8, the same sample image is divided into four sub-sample images, the upper left image is sub-sample image data corresponding to the upper region m1 in the sample image shown in fig. 6, the upper right image is sub-sample image data corresponding to the middle region m2 in the sample image shown in fig. 6, the lower left image is sub-sample image data corresponding to the lower region m3 in the sample image shown in fig. 6, and the lower right image is sub-sample image data corresponding to the full-face region m4 in the sample image shown in fig. 6, then sub-sample image data corresponding to the upper region m1 is input to the upper specificity network for training the upper specificity network, sub-sample image data corresponding to the middle region m2 is input to the middle specificity network for training the middle specificity network, sub-sample image data corresponding to the lower region m3 is input to the lower specificity network for training the lower specificity network, and inputting the sub-sample image data corresponding to the full-face area m4 into the full-face specific network for training the full-face specific network.

In addition, it should be noted that, a part of the sample image data acquired in step S311 is used for training the network model, and another part of the sample image data is used for testing the network model, specifically, two types of the sample to be trained are randomly divided into a training set and a test set in a ratio of 8:2, where the training set is used for training the face multi-attribute recognition model, and the test set is used for testing the face multi-attribute recognition model, so as to ensure that data of the same person only appears in one set.

And sending the test data set to the trained specific network and shared network for testing, verifying the accuracy of the network model, and sending the sample with the judgment error in the test data set to the network model again for fine tuning, so that the generalization of the model is improved.

In the embodiment of the application, an Adam gradient descent algorithm is adopted, and Adam is an efficient calculation method and can improve the gradient descent convergence speed. In the training process, a training set is input into a convolutional neural network model and iterated for a preset number of epochs, and the method is set to set the epochs to be 90 times. In each iterative calculation process, an Adam gradient descent algorithm is used for optimizing an objective function, and the method sets the batch _ size to 64, namely 64 input images are fed in each round of training. The specific network and the shared network are trained based on a convolutional neural network model.

Aiming at the multi-attribute problem, the method uses the cross entropy as a loss function for training, the function is used as a standard for measuring the cross entropy between a target and an output, and the formula is as follows:

in the above formula, m represents the total number of attributes, n_iRepresents the total number of samples of the ith attribute,

represents the label value of the ith attribute, the jth sample, and

refers to the predicted value of the jth sample of the ith attribute.

S340: image data to be processed is acquired.

It should be noted that, in addition to the above steps, the detection of the face orientation may be added after the image data to be processed is acquired, and specifically, after it is determined that the current image includes a face image, it may be determined whether the face orientation in the currently acquired face image satisfies a preset orientation. Specifically, for example, a camera in the electronic device collects a face image, the electronic device responds to a face recognition request and calls the camera to collect the face image, and position information of a face key point of a user in the face image is recognized, so that whether the face orientation is a preset orientation can be determined.

As shown in fig. 9, the face orientation of the left image is the right side of the user who captured the face image, and several key points on the image can be determined, namely, the left eye a1, the right eye a2, the nose a3 and the lips a4, and in the vertical symmetry line of the image, it can be seen that the right eye a2, the nose a3 and the lips a4 are all located at the left side of the symmetry line, so that the face orientation of the user can be determined to be deviated to the right. In the face image on the right side, as shown in fig. 9, the left eye b1, the right eye b2, the nose b3 and the lips b4 are located near the line of symmetry, and the left eye b1 and the right eye b2 are located on two sides of the line of symmetry, so that it can be determined that the face is facing the screen, and if the face is in line with the preset orientation, the face image can be normally acquired, and the face image can be recognized in the later stage.

S350: and inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data.

S360: and inputting the attribute label determined by each specific network into a pre-trained shared network to obtain an image recognition result.

S370: and outputting the image recognition result.

It should be noted that the face image not only contains face attribute information such as facial features, race, gender, age, and expression, but also can express identity information of a person. Therefore, the face attribute recognition has wide application prospects in the fields of age-related access control, face retrieval of face attributes, security protection, man-machine interaction and the like.

Referring to fig. 10, an image processing method provided in an embodiment of the present application is shown, where the method includes: s1001 to S1004.

S1001: obtaining a plurality of sample image data, wherein each sample image data corresponds to a plurality of attribute labels.

S1002: setting a shared network and a plurality of specific networks, wherein each specific network can identify at least one attribute label, and the attribute labels identified by each specific network are different from each other.

S1003: and inputting the plurality of sample image data into the shared network and the plurality of specific networks for training so as to obtain the trained shared network and the plurality of specific networks.

S1001 to S1003 are training processes of a shared network and a plurality of specific networks, and reference may be made to the foregoing S310 to S330 for specific embodiments, which are not described herein again.

S1004: and acquiring image data to be processed, and processing the image data to be processed according to the trained shared network and the plurality of specific networks to obtain an image recognition result.

The to-be-processed image data is processed according to the trained shared network and the multiple specific networks, and an image recognition result is obtained by referring to the foregoing embodiment, which is not described herein again.

Referring to fig. 11, a block diagram of an image processing apparatus 1100 according to an embodiment of the present disclosure is shown, where the apparatus may include: a data acquisition unit 1110, an attribute determination unit 1120, a result acquisition unit 1130, and an output unit 1140.

A data acquiring unit 1110 for acquiring image data to be processed.

An attribute determining unit 1120, configured to input the image data to be processed into a plurality of pre-trained specific networks to obtain attribute tags corresponding to the image data, where each specific network is used to determine the attribute tag corresponding to the image data, and the attribute tags determined by the specific networks are different from each other.

A result obtaining unit 1130, configured to input the attribute label determined by each of the specificity networks into a pre-trained shared network to obtain an image recognition result, where the shared network is configured to determine the image recognition result according to each attribute label and a correlation between each attribute label.

An output unit 1140, configured to output the image recognition result.

Referring to fig. 12, a block diagram of an image processing apparatus 1200 according to an embodiment of the present disclosure is shown, where the apparatus may include: a training unit 1210, a data acquisition unit 1220, an attribute determination unit 1230, a result acquisition unit 1240, and an output unit 1250.

The training unit 1210 is configured to train the shared network and the plurality of specific networks.

Specifically, the training unit 1210 includes a sample acquisition subunit 1211, a setting subunit 1212, and a training subunit 1213.

An obtaining subunit 1211, configured to obtain a plurality of sample image data, where each sample image data corresponds to a plurality of attribute tags.

A setting subunit 1212, configured to set a shared network and a plurality of specific networks, where each specific network is capable of identifying at least one attribute tag, and the attribute tags that each specific network is capable of identifying are different from each other.

A training subunit 1213, configured to input the plurality of sample image data into the shared network and the plurality of specific networks for training, so as to obtain a trained shared network and a plurality of specific networks.

Further, the sample image data and the image data to be processed both include a face image, and the obtaining subunit 1211 is further configured to obtain a plurality of sample image data; identifying position information of a face key point in each sample image data in the sample image; adjusting the face orientation in each sample image data to accord with a preset orientation according to the position information of the face key point of each sample image data; and taking the adjusted sample image data as the sample image data used for training the shared network and the plurality of initial specific networks.

Further, the obtaining subunit 1211 is further configured to obtain a plurality of sample image data; performing data enhancement processing on the plurality of sample image data to ensure that the illumination intensity and the contrast of each sample image data are randomly distributed in a preset interval; clipping each sample image data after enhancement processing according to a preset random clipping proportion, wherein the size of each clipped sample image data is a preset size; and taking the clipped sample image data as the sample image data used for training the shared network and the plurality of initial specific networks.

Further, the sample image data and the image data to be processed both contain face images, and a plurality of attribute labels corresponding to each sample image data correspond to different positions of a face; the setting subunit 1212 is further configured to divide a plurality of measurement regions, where each measurement region corresponds to a different region of the human face; and setting a plurality of specific networks according to the plurality of measurement areas, wherein each specific network corresponds to one measurement area and is used for confirming the attribute labels in the corresponding measurement area.

A data obtaining unit 1220, configured to obtain image data to be processed.

Further, the data obtaining unit 1220 is also configured to obtain original image data; and normalizing the original image data to obtain image data to be processed.

An attribute determining unit 1230, configured to input the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data, where each specific network is used to determine an attribute label corresponding to the image data, and the attribute labels determined by the specific networks are different from each other.

Further, the attribute determining unit 1230 is further configured to determine an attribute label corresponding to each specific network, where the attribute label that can be identified by the specific network is the attribute label corresponding to the specific network; dividing the image data into a plurality of sub-image data according to the attribute label corresponding to each specific network; and inputting the sub-image data into a specific network corresponding to the sub-image data.

The result obtaining unit 1240 is configured to input the attribute labels determined by each of the specific networks into a pre-trained shared network to obtain an image recognition result, where the shared network is configured to determine the image recognition result according to the attribute labels and the correlations between the attribute labels.

An output unit 1250 configured to output the image recognition result.

Referring to fig. 13, a block diagram of an image processing apparatus 1300 according to an embodiment of the present disclosure is shown, where the apparatus may include: a sample acquisition unit 1310, a setting unit 1320, a network training unit 1330, and a recognition unit 1340.

A sample acquiring unit 1310 configured to acquire a plurality of sample image data, where each sample image data corresponds to a plurality of attribute tags.

A setting unit 1320, configured to set a shared network and a plurality of specific networks, where each specific network is capable of identifying at least one attribute tag, and the attribute tags that each specific network is capable of identifying are different from each other.

A network training unit 1330, configured to input the sample image data into the shared network and the specific networks for training, so as to obtain a trained shared network and specific networks.

The sample acquiring unit 1310, the setting unit 1320, and the network training unit 1330 correspond to the training unit 1210. As an embodiment, the sample acquiring unit 1310 corresponds to an acquiring subunit, a specific embodiment of the sample acquiring unit 1310 may refer to the acquiring subunit, the setting unit 1320 corresponds to a setting subunit, a specific embodiment of the setting unit 1320 may refer to the setting subunit, the network training unit 1330 corresponds to a training subunit, and a specific embodiment of the network training unit 1330 may refer to the training subunit.

And the identifying unit 1334 is configured to acquire image data to be processed, and process the image data to be processed according to the trained shared network and multiple specific networks to obtain an image identification result.

Specifically, the recognition unit 1334 is configured to acquire image data to be processed; inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data, wherein each specific network is used for determining the attribute labels corresponding to the image data, and the attribute labels determined by each specific network are different from each other; inputting the attribute labels determined by each specific network into a pre-trained shared network to obtain an image recognition result, wherein the shared network is used for determining the image recognition result according to the attribute labels and the correlation of the attribute labels; and outputting the image recognition result.

As an embodiment, the recognition unit 1334 corresponds to a data acquisition unit, an attribute determination unit, a result acquisition unit, and an output unit, and specific embodiments may refer to the data acquisition unit, the attribute determination unit, the result acquisition unit, and the output unit.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 14, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.

Referring to fig. 15, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer readable medium 1500 has stored therein program code that can be called by a processor to perform the method described in the above method embodiments.

The computer-readable storage medium 1500 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1500 includes a non-volatile computer-readable storage medium. The computer readable storage medium 1500 has storage space for program code 1510 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1510 may be compressed, for example, in a suitable form.

To sum up, the image processing method, the image processing apparatus, the electronic device, and the storage medium provided by the present application, a shared network and a plurality of specific networks are trained in advance, each specific network is used to determine an attribute tag corresponding to the image data, and the attribute tags determined by the specific networks are different from each other, when the image data to be processed is obtained, the image data to be processed is input into the plurality of specific networks, each specific network can identify an attribute that can be identified by the specific network, so that the plurality of attribute tags corresponding to the image data to be processed can be identified by the plurality of specific networks, respectively, the identification of the plurality of attribute tags of the entire image data is improved, so that the attribute tags corresponding to the image data can be obtained, and then the attribute tags corresponding to the image data are input into the shared network, and the shared network determines an image identification result according to the attribute tags and the correlation of the attribute tags and outputs the image identification result. Therefore, the plurality of specific networks jointly analyze the image data and obtain the plurality of attribute tags, the obtaining speed of the attribute tags can be improved, the shared network can obtain the image recognition result by combining the correlation of the attribute tags, and the accuracy and the overall performance of the recognition result are improved.

The method divides 40 face attributes into 4 face attribute groups according to image positions corresponding to the attributes, considers the display position correlation among the face attributes, considers the attribute classification problem of each attribute group as a subtask, and constructs a model containing 4 specific networks and 1 shared network.

The specificity network aims to learn the specificity among tasks, so that each attribute group is configured with a single specificity network, and the shared network aims to learn complementary information among the tasks and promote interaction among the tasks. The abundant connection between the specific network and the shared network promotes mutual information exchange, is beneficial to mining the correlation between tasks, and improves the overall performance.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An image processing method, comprising:

acquiring image data to be processed;

inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data, wherein each specific network is used for determining the attribute labels corresponding to the image data, and the attribute labels determined by each specific network are different from each other;

inputting the attribute labels determined by each specific network into a pre-trained shared network to obtain an image recognition result, wherein the shared network is used for determining the image recognition result according to the attribute labels and the correlation of the attribute labels;

and outputting the image recognition result.

2. The method according to claim 1, wherein the inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data comprises:

determining an attribute label corresponding to each specific network, wherein the attribute label which can be identified by the specific network is the attribute label corresponding to the specific network;

dividing the image data into a plurality of sub-image data according to the attribute label corresponding to each specific network;

and inputting the sub-image data into a specific network corresponding to the sub-image data.

3. The method of claim 1, wherein the acquiring image data to be processed comprises:

acquiring original image data;

and normalizing the original image data to obtain image data to be processed.

4. The method of claim 1, wherein before inputting the image data to be processed into a plurality of pre-trained specific networks to obtain the attribute labels corresponding to the image data, the method further comprises:

obtaining a plurality of sample image data, wherein each sample image data corresponds to a plurality of attribute labels;

setting a sharing network and a plurality of specific networks, wherein each specific network can identify at least one attribute label, and the attribute labels which can be identified by each specific network are different from each other;

and inputting the plurality of sample image data into the shared network and the plurality of specific networks for training so as to obtain the trained shared network and the plurality of specific networks.

5. The method according to claim 4, wherein the sample image data and the image data to be processed each contain a face image; the acquiring a plurality of sample image data includes:

acquiring a plurality of sample image data;

identifying position information of a face key point in each sample image data in the sample image;

adjusting the face orientation in each sample image data to accord with a preset orientation according to the position information of the face key point of each sample image data;

and taking the adjusted sample image data as the sample image data used for training the shared network and the plurality of initial specific networks.

6. The method of claim 4, wherein said acquiring a plurality of sample image data comprises:

acquiring a plurality of sample image data;

performing data enhancement processing on the plurality of sample image data to ensure that the illumination intensity and the contrast of each sample image data are randomly distributed in a preset interval;

clipping each sample image data after enhancement processing according to a preset random clipping proportion, wherein the size of each clipped sample image data is a preset size;

and taking the clipped sample image data as the sample image data used for training the shared network and the plurality of initial specific networks.

7. The method according to claim 4, wherein the sample image data and the image data to be processed each contain a face image, and the plurality of attribute labels corresponding to each sample image data correspond to different positions of a face; the setting shared network and a plurality of specific networks comprise:

dividing a plurality of measurement areas, wherein each measurement area corresponds to a different area of the human face;

and setting a plurality of specific networks according to the plurality of measurement areas, wherein each specific network corresponds to one measurement area and is used for confirming the attribute labels in the corresponding measurement area.

8. An image processing method, comprising:

inputting the sample image data into the shared network and the specific networks for training to obtain a trained shared network and specific networks;

and acquiring image data to be processed, and processing the image data to be processed according to the trained shared network and the plurality of specific networks to obtain an image recognition result.

9. An image processing apparatus, characterized in that the apparatus comprises:

the data acquisition unit is used for acquiring image data to be processed;

the attribute determining unit is used for inputting the image data to be processed into a plurality of pre-trained specific networks to obtain attribute labels corresponding to the image data, wherein each specific network is used for determining the attribute labels corresponding to the image data, and the attribute labels determined by the specific networks are different from each other;

the result acquisition unit is used for inputting the attribute labels determined by each specific network into a pre-trained shared network to acquire an image recognition result, wherein the shared network is used for determining the image recognition result according to the attribute labels and the correlation of the attribute labels;

and the output unit is used for outputting the image recognition result.

10. An image processing apparatus, characterized in that the apparatus comprises:

the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a plurality of sample image data, and each sample image data corresponds to a plurality of attribute labels;

a setting unit, configured to set a shared network and a plurality of specific networks, where each specific network is capable of identifying at least one attribute tag, and the attribute tags that each specific network is capable of identifying are different from each other;

the network training unit is used for inputting the sample image data into the shared network and the specific networks for training so as to obtain the trained shared network and the specific networks;

and the identification unit is used for acquiring image data to be processed, and processing the image data to be processed according to the trained shared network and the plurality of specific networks to obtain an image identification result.

11. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-7.

12. A computer-readable storage medium storing program code executable by a processor, wherein a plurality of instructions in the program code, when executed by the processor, cause the processor to perform the method of any one of claims 1-7.