WO2022037541A1 - 图像处理模型训练方法、装置、设备及存储介质 - Google Patents

图像处理模型训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022037541A1
WO2022037541A1 PCT/CN2021/112829 CN2021112829W WO2022037541A1 WO 2022037541 A1 WO2022037541 A1 WO 2022037541A1 CN 2021112829 W CN2021112829 W CN 2021112829W WO 2022037541 A1 WO2022037541 A1 WO 2022037541A1
Authority
WO
WIPO (PCT)
Prior art keywords
occlusion
image
face
indication information
image processing
Prior art date
Application number
PCT/CN2021/112829
Other languages
English (en)
French (fr)
Inventor
邱海波
龚迪洪
李志鋒
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21857635.3A priority Critical patent/EP4099217A4/en
Publication of WO2022037541A1 publication Critical patent/WO2022037541A1/zh
Priority to US17/961,345 priority patent/US20230033052A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/273Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion removing elements interfering with the pattern to be recognised
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/162Detection; Localisation; Normalisation using pixel segmentation or colour matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to an image processing model training method, apparatus, device and storage medium.
  • the image processing model can be trained based on artificial intelligence technology to obtain a trained image processing model. After that, input the image to be recognized into the trained image processing model, and then the corresponding processing results can be obtained.
  • Embodiments of the present application provide an image processing model training method, apparatus, device, and storage medium.
  • the technical solution is as follows:
  • an image processing model training method comprising:
  • the model parameters of the image processing model are updated.
  • an image processing method comprising:
  • feature extraction is performed on the target face image to be identified to obtain the second overall image feature of the target face image
  • second occlusion indication information corresponding to the second overall image feature, where the second occlusion indication information is used to indicate the image feature of the face occlusion area of the target face image
  • the second occlusion indication information remove the image feature of the face occlusion area in the second overall image feature to obtain the second target image feature;
  • face recognition is performed on the target face image.
  • an image processing model training device comprising:
  • a first acquisition module configured to acquire, based on the image processing model, the predicted recognition result of the first sample face image and first occlusion indication information, where the first occlusion indication information is used to indicate the person of the first sample face image Image features of face occlusion areas;
  • a second obtaining module configured to obtain a recognition error based on the predicted recognition result and the target recognition result corresponding to the first sample face image
  • a third acquiring module configured to acquire a classification error based on the first occlusion indication information and a target occlusion pattern corresponding to the first sample face image, wherein the occlusion pattern of the first sample face image is used to indicate the position and size of the face occlusion area;
  • An update module configured to update the model parameters of the image processing model according to the identification error and the classification error.
  • an electronic device in another aspect, includes one or more processors and one or more memories, the one or more memories store at least one piece of program code, the at least one piece of program code Loaded and executed by the one or more processors to implement the above-mentioned image processing model training method.
  • a computer-readable storage medium wherein at least one piece of program code is stored in the storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the above-mentioned image processing model training method.
  • a computer program product or computer program comprising one or more pieces of program codes stored in a computer-readable storage medium.
  • One or more processors of the electronic device can read the one or more program codes from a computer-readable storage medium, the one or more processors execute the one or more program codes to enable the electronic device to Perform the image processing model training method described above.
  • an electronic device in another aspect, includes one or more processors and one or more memories, the one or more memories store at least one piece of program code, the at least one piece of program code Loaded and executed by the one or more processors to implement the image processing method described above.
  • a computer-readable storage medium wherein at least one piece of program code is stored in the storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the above-mentioned image processing method.
  • a computer program product or computer program comprising one or more pieces of program codes stored in a computer-readable storage medium.
  • One or more processors of the electronic device can read the one or more program codes from a computer-readable storage medium, the one or more processors execute the one or more program codes to enable the electronic device to The image processing method described above is performed.
  • the embodiment of the present application introduces an occlusion mode, and the predicted occlusion mode of the first sample face image and the target occlusion mode corresponding to the first sample face image are determined according to the first occlusion indication information generated in the face recognition process.
  • the image processing model can be trained to determine more accurate first occlusion indication information, and then face recognition is performed based on the accurate first occlusion indication information, and the obtained recognition result is more accurate, and the image processing model can be more accurate. Accurately process occluded face images, that is, the robustness of the image processing model is better.
  • the image processing model can directly process the face image of the first sample to obtain the recognition result, and can perform image processing end-to-end without the aid of an external network, thus significantly reducing the amount of calculation and improving the running speed of the device. It can also effectively reduce the number of models, and since the accuracy of image processing by the image processing model is not affected by external network factors, the accuracy is significantly improved.
  • FIG. 1 is a schematic diagram of an implementation environment of an image processing model training method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an attendance system provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of an image processing model training method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an image processing model training method provided by the related art
  • FIG. 5 is a schematic diagram of a method for using an image processing model provided by the related art
  • FIG. 6 is a flowchart of an image processing model training method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a face image provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of determining the number of face occlusion areas and occlusion modes provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an image processing model provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of an image processing model provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a decoder provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a process of using an image processing model provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of an image processing model training process provided by an embodiment of the present application.
  • 15 is a schematic structural diagram of an image processing model training apparatus provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • first, second and other words are used to distinguish the same or similar items with basically the same function and function, and it should be understood that between “first”, “second” and “nth” There are no logical or timing dependencies, and no restrictions on the number and execution order. It will also be understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first image could be termed a second image, and, similarly, a second image could be termed a first image, without departing from the scope of various described examples. The first image and the second image are both images, and in some cases, separate and distinct images.
  • the size of the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be used in the embodiment of the present application. Implementation constitutes any limitation.
  • determining B based on A does not mean determining B based only on A, but can also determine B based on A and/or other information.
  • the face occlusion area refers to the area in the image where the face is occluded.
  • Convolution feature It is the output of the convolutional layer of a deep convolutional network, usually a three-dimensional tensor with C channels, height H and width W, that is, f( ) ⁇ R C*H*W .
  • Convolutional feature elements refer to tensor elements with coordinates (C, H, W).
  • the tensor concept is a generalization of the vector concept, and the vector is a first-order tensor.
  • a tensor is a multilinear function that can be used to represent linear relationships between some vectors, scalars, and other tensors.
  • Feature mask It is a three-dimensional tensor with the same size as the convolution feature, and the value of each element in the feature mask is between [0, 1].
  • the function of the feature mask is to remove contaminated feature elements, and the contaminated feature elements also refer to the feature elements of the face occlusion area.
  • End-to-end system It means that the system does not need the help of an external network or system, but only relies on its own system to obtain the expected output from the input. End-to-end also refers to the above-mentioned way of relying only on itself to obtain the expected output from the input.
  • Robustness is the transliteration of Robust, which means robust and strong. In computing, it refers to the ability of a system to survive abnormal and dangerous situations. For example, whether the computer software can not crash or crash under the condition of input error, disk failure, network overload or intentional attack is the robustness of the computer software. The so-called “robustness” also refers to the characteristic that the control system maintains some other performance under the perturbation of parameters (eg, structure or size).
  • FIG. 1 is a schematic diagram of an implementation environment of an image processing model training method provided by an embodiment of the present application.
  • the implementation environment includes the terminal 101 , or the implementation environment includes the terminal 101 and the image processing platform 102 .
  • the terminal 101 is connected to the image processing platform 102 through a wireless network or a wired network.
  • the terminal 101 can be a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Compression Standard Audio Layer 3) player or an MP4 (Moving Picture Experts Group Audio Layer III) player Layer IV, Motion Picture Expert Compression Standard Audio Layer 4) At least one of players, laptop computers, security inspection equipment, and attendance equipment.
  • the terminal 101 installs and runs an application program supporting image processing model training, for example, the application program can be a security check application, an attendance application, a system application, an instant messaging application, a shopping application, an online video application, and a social networking application.
  • the terminal 101 has an image acquisition function and an image processing function, can perform image processing on the acquired image, and execute corresponding functions according to the processing result.
  • the terminal 101 can complete the work independently, and can also provide it with data services or image processing services through the image processing platform 102 .
  • the image processing platform 102 can obtain a sample face image to train an image processing model, and after the terminal 101 collects the image, the collected image is sent to the image processing platform 102, and the image processing platform 102 based on the trained image
  • the image processing model provides the terminal 101 with image processing services.
  • the image processing platform 102 includes at least one of a server, multiple servers, a cloud computing platform and a virtualization center.
  • the image processing platform 102 is used to provide background services for applications that support image processing model training.
  • the image processing platform 102 undertakes the main processing work, and the terminal 101 undertakes the secondary processing work; or, the image processing platform 102 undertakes the secondary processing work, and the terminal 101 undertakes the main processing work; or, the image processing platform 102 or the terminal 101 can individually undertake processing work.
  • a distributed computing architecture is used for collaborative computing between the image processing platform 102 and the terminal 101 .
  • the image processing platform 102 includes at least one server 1021 and a database 1022, and the database 1022 is used for storing data.
  • the database 1022 can store sample face images, which are at least one A server 1021 provides data services.
  • the server 1021 can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the number of the above-mentioned terminals 101 and servers 1021 can be more or less.
  • the above-mentioned terminal 101 and server 1021 are only one, or the above-mentioned terminal 101 and server 1021 are tens or hundreds, or more, the embodiments of the present application do not limit the number of terminals or servers.
  • the embodiments of the present application also do not limit the device type of the terminal or the server.
  • the image processing model can provide an image processing service, and the image processing service can be applied to any face recognition scene. In the scene, regardless of whether the face in the collected face image is occluded, it can be accurately identified by the image processing model.
  • the image processing model can be applied to face recognition scenarios such as attendance systems, security inspection systems, face unlocking of mobile phones or computers, or face recognition payment. Users only need to upload an unobstructed face image when the system is initially established, and store it in the system database as an image to be recognized. During recognition, only the user's image to be recognized is required, and no other redundant operations are required.
  • the user 201 only needs to be in front of the camera of the attendance device 202, and the attendance device 202 collects the face image 203 for the user 201, and then the The face image 203 is subjected to face recognition, and the identity information 204 of the user 201 is determined, and it can be recorded that the user of the identity information 204 has clocked in attendance.
  • the attendance device 202 can be replaced by other devices.
  • the attendance device can be replaced by a security inspection device, and after identifying the identity information, it can display "identity verification passed", or, Security check facility released.
  • FIG. 3 is a flowchart of an image processing model training method provided by an embodiment of the present application. The method is applied to an electronic device, where the electronic device is a terminal or a server. Referring to FIG. 3 , the method includes the following steps.
  • the electronic device obtains, based on an image processing model, a predicted recognition result of a first sample face image and first occlusion indication information, where the first occlusion indication information is used to indicate an image of a face occlusion area of the first sample face image feature.
  • the first sample face image is an image including a human face, and the image including a human face is used as a sample to train an image processing model.
  • a sample is a portion of an individual observed or investigated.
  • the image processing model is used to process the input face image of the first sample, and output a prediction and recognition result.
  • the first sample face image may include a face image without occlusion or a face image with occlusion.
  • the image processing model is trained based on the first sample face image.
  • the image processing model can accurately process the face image. identify.
  • the image processing model can perform feature extraction on the first sample face image, and then determine which image features are affected by the occluded area of the face.
  • the image features of the face area are used for face recognition.
  • the above-mentioned first occlusion indication information is used to identify the image features affected by the face occlusion area, and these image features affected by the face occlusion area can also be used as the contaminated image features.
  • the model parameters of the image processing model may be initial values, which may be obtained by initialization, or may be initial values obtained by pre-training other first sample face images, which are implemented in this application.
  • the example does not limit this.
  • the electronic device acquires a recognition error based on the predicted recognition result and the target recognition result corresponding to the first sample face image.
  • the identification error is used to determine whether the model parameters of the image processing model need to be adjusted and how to adjust the model parameters, so as to improve the accuracy of the image processing model in processing images.
  • the predicted recognition result output by the image processing model is the recognition result predicted by the image processing model, which may also be referred to as a "predicted value”.
  • the accuracy of the predicted recognition result is consistent with the image processing accuracy of the image processing model.
  • the target recognition result marked by each first sample face image is the real and correct recognition result, which can also be called “true value”. By comparing the recognition error obtained by the "predicted value” and the "true value”, the accuracy of the predicted recognition result can be measured, and the accuracy of the image processing model can also be measured.
  • the recognition error is relatively large, the image processing model has relatively poor accuracy in processing images; if the recognition error is relatively small, the image processing model has relatively good accuracy in processing images.
  • the electronic device acquires a classification error based on the first occlusion indication information and the target occlusion mode corresponding to the first sample face image.
  • the occlusion mode of the first sample face image is used to indicate the position and size of the occlusion area of the face.
  • obtaining a classification error based on the first occlusion indication information and a target occlusion pattern corresponding to the first sample face image includes: determining, based on the first occlusion indication information, a Predict occlusion mode. Then, based on the predicted occlusion pattern and the target occlusion pattern, the classified error is obtained. Wherein, determining the predicted occlusion mode of the first sample face image, that is, classifying the occlusion mode of the first sample face image, to obtain the predicted occlusion mode.
  • the positions or sizes of the face occlusion regions in different first sample face images may be different. Naturally, the image features affected by the face occlusion regions are different, that is, the first occlusion indication information is different. According to the different positions and sizes of the face occlusion areas, different occlusion modes are set, and each occlusion mode corresponds to the position and size of a face occlusion area.
  • the predicted occlusion mode is the "predicted value”
  • the target occlusion mode is the real and correct occlusion mode, that is, a "true value”.
  • the classification obtained can measure the accuracy of the predicted occlusion mode. Since the predicted occlusion mode is determined based on the first occlusion indication information, it can also measure the accuracy of the first occlusion indication information.
  • the supervised learning of the occlusion mode is added, which can supervise the ability of the image processing model to learn the accurate first occlusion indication information, and then perform face recognition according to the accurate first occlusion indication information and image features, and obtain the recognition result. Also more accurate.
  • the electronic device updates the model parameters of the image processing model according to the identification error and the classification error.
  • both the recognition error and the classification error are considered, wherein the recognition error trains the image processing model to have good face recognition ability, and the classification error trains the image processing model to output a more accurate first occlusion Instruction information, thereby improving the accuracy of face recognition.
  • the embodiment of the present application introduces an occlusion mode, and the predicted occlusion mode of the first sample face image is determined through the first occlusion indication information generated in the face recognition process, and the predicted occlusion mode of the first sample face image is marked with the first sample face image.
  • the training image processing model can output more accurate occlusion indication information, and then perform face recognition based on the accurate occlusion indication information, and the obtained recognition result is more accurate.
  • the image processing model can be more accurate. Accurately process occluded face images, that is, the robustness of the image processing model is better.
  • the image processing model can directly process the face image of the first sample to obtain the recognition result, and can perform image processing end-to-end without the aid of an external network, thus significantly reducing the amount of calculation and improving the operation of the device.
  • the speed can also effectively reduce the number of models, and since the accuracy of the image processing model processing images is not affected by external network factors, the accuracy has been significantly improved.
  • the following provides a training method of an image processing model in the related art, and compares and analyzes the related art with the method provided in this application.
  • Fig. 4 and Fig. 5 are respectively an image processing model training method and an image processing model usage method provided by the related art.
  • Fig. 4 shows a paired differential siamese network, which can explicitly learn human
  • the mapping relationship between the face occlusion area and the image features affected by the face occlusion area can be called Learn Mask Generators. Based on the mapping relationship, a dictionary corresponding to an occlusion block (face occlusion area)-mask is established, and the mask is used to indicate the image features that are greatly affected by the face occlusion area. Establish Mask Dictionary).
  • Each index item in the mask dictionary represents the image feature that is greatly affected when a certain area on the face is occluded; if each element in the image feature is called a convolution feature element, the Each index item in the dictionary represents the most affected top-level convolutional feature element when a certain area on the face is occluded.
  • the fully convolutional network FCN is used to detect the face occlusion area of the input face image, and then according to the above dictionary, the occlusion conditions that should be removed can be obtained.
  • the mapping relationship is learned through an external network (the above-mentioned paired differential twinning network), and it is established as a dictionary.
  • an external network the above-mentioned paired differential twinning network
  • the mapped mask is obtained by querying the dictionary, and the image processing model performs face recognition based on the mapped mask.
  • the image processing model relies on an external network, and the external network is trained separately from the image processing model.
  • the device runs slower due to the significant increase in computation due to the presence of the external network.
  • the accuracy with which the external network detects the face occlusion area also greatly affects the determination of the feature elements that should be removed. That is, if the detection of the face occlusion area is inaccurate, the subsequent removal of the contaminated feature elements will also be inaccurate. Interfere with the final face recognition.
  • the image processing model provided by the embodiment of the present application is an end-to-end system, without the aid of an external network, the image processing model can dynamically learn the face occlusion area and the face occlusion area based on the input face image
  • the mapping relationship between the affected image features so that the image processing model can output recognition results directly based on the input face image, which can significantly reduce the amount of calculation, improve the running speed of the device, and effectively reduce the number of models, and
  • the image processing model's accuracy in processing images is not affected by external network factors, and the accuracy is significantly improved.
  • the training image processing model can output more accurate first occlusion indication information, thereby improving the accuracy of image processing by the image processing model, and the image processing model can process images more accurately There are occluded face images, that is, the robustness of the image processing model is better.
  • the face is divided into 9 different regions.
  • a network needs to be trained independently to learn the mapping for that region. That is to say, 9 different models need to be trained when building the dictionary, which greatly increases the training time and training cost, and the storage space occupied by the large number of models will be very large, and it is not easy to deploy them to practical applications. .
  • the image processing model can dynamically determine its first occlusion indication information according to the face image, which can greatly reduce the model training time and training cost.
  • the nine models in the technology are easier to deploy on various types of devices to realize the corresponding image processing functions, so the applicability and practicability of the image processing model are better.
  • FIG. 6 is a flowchart of an image processing model training method provided by an embodiment of the present application. The method is applied to an electronic device, where the electronic device is a terminal or a server. Referring to FIG. 6 , the method includes the following steps.
  • the electronic device acquires a first sample face image, where the first sample face image is marked with a target recognition result and a target occlusion mode.
  • the first sample face image may include an unoccluded face image or an occluded face image.
  • the unoccluded face image refers to an image with an unoccluded face, which can be called a clean face image.
  • An occluded face image refers to an image in which a face is occluded, which can be called an occluded face image.
  • the face in the image 701 is completely represented and there is no occlusion, and the image 701 is also an unoccluded face image and a clean face image.
  • the image 702 is also occluded Face image, face image with occlusion.
  • the electronic device can acquire the first sample face image in various ways.
  • the first sample face image may be stored in an image database, and when the electronic device needs to train an image processing model, the first sample face image may be extracted from the database.
  • the first sample face image may be a resource in a website, and the electronic device can download the first sample face image from the target website.
  • the first sample face image may be stored in the electronic device.
  • the first sample face image is a historical image sent to the electronic device by other devices, or generated by the electronic device. image, the electronic device can extract the first sample face image from the local storage space.
  • the above provides several possible implementations for obtaining the first sample face image.
  • the electronic device may also obtain the first sample face image in other ways.
  • the embodiment of the present application does not specifically limit the obtaining method of the first sample face image.
  • the recognition function of the image processing model is different, and the target recognition result is also different.
  • the image processing model is used to perform identity authentication on the face in the image, and accordingly, the target recognition result is identity authentication information.
  • the image processing model is used to identify the face attribute or face type of the face in the image, such as judging whether the face has glasses, or judging the gender of the face, etc.
  • the target recognition result is a face attribute or face type.
  • the target recognition result may be stored with the first sample face image, and the first sample face image is marked with the target recognition result.
  • the electronic device obtains the first sample face image, it can obtain the first sample face image and the corresponding target recognition result.
  • the target recognition result is determined based on an annotation operation. For example, relevant technical personnel may mark the first sample face image, and mark the target recognition result corresponding to each first sample face image.
  • the target occlusion mode is used to indicate the position and size of the face occlusion area in the first sample face image.
  • the first sample face image may include multiple occlusion modes, and the target occlusion mode is one of the multiple occlusion modes.
  • the first sample face image or the target face image to be identified may include at least two regions, and each occlusion mode corresponds to occlusion region information, where the occlusion region information is used to indicate the at least two regions in the above-mentioned at least two regions. Whether each area is occluded. According to the different face occlusion areas, various occlusion modes can be divided into.
  • the first sample face image or the target face image to be identified includes K*K regions that can be divided into K*K regions, and each region represents a small block in the face that may be occluded (that is, image block).
  • K is an integer greater than 1.
  • the electronic device may obtain the number of occlusion modes according to the number of divided regions.
  • there are two possibilities for each area namely, occlusion and non-occlusion, and the occlusion situation can be divided into 2 K*K different occlusion modes. For example, when K is 4, there will be 65536 different occlusion modes.
  • occlusion mode determination mechanism is proposed here. Through the observation of face images, it is found that adjacent regions usually have similar occlusion states, that is, when a region is occluded, the possibility of occlusion of adjacent regions is relatively high. For example, when the left eye region is occluded, the right eye region also has a relatively large probability of being occluded, and this feature is called proximity here. Based on the proximity, the face occlusion area can be constrained, and then a small number of occlusion patterns can be determined.
  • the constrained occlusion mode covers m*n regions, where m, n have a value range of [1, K], and m and n are the width and height of the face occlusion region, respectively.
  • m, n have a value range of [1, K]
  • m and n are the width and height of the face occlusion region, respectively.
  • FIG. 8 the figure shows the position and size of the face occlusion area 801 (marked with a bold frame) in several occlusion modes when K is 4.
  • a face occlusion area in a face image is a connected domain, and the face occlusion area is a quadrilateral area.
  • the electronic device can obtain the number of occlusion modes according to the number of areas divided by the face. For example, as shown in (b) of FIG.
  • the number of corresponding occlusion patterns 802 when the size of the face occlusion area changes.
  • the value at the (i,j)th position in the matrix represents the number of occlusion patterns when the size of the face occlusion area is (i*j).
  • the value 16 at the (1,1) position means that when the size of the face occlusion area is 1*1, the occlusion mode can include 16 types, that is, when the face occlusion area is located in 16 areas Corresponds to 16 occlusion modes. Values at other positions in the matrix are calculated in the same way, and will not be listed here. Then when K is 4, 101 occlusion modes can be determined.
  • the electronic device performs feature extraction on a first sample face image based on an image processing model, to obtain image features of the first sample face image.
  • the image features of the first sample image are also referred to herein as first overall image features.
  • the image processing model is an initial model, and the electronic device can input the first sample face image into the image processing model, and the image processing model processes the first sample face image.
  • the electronic device may perform feature extraction on the first sample face image, and use image features to express pixel characteristics or the relationship between pixels of the first sample face image.
  • the electronic device may preprocess the first sample face image, and then perform feature extraction on the preprocessed first sample face image.
  • the first sample face image during feature extraction is more in line with the feature extraction specification, so as to improve the processing efficiency, reduce the computational complexity and amount of computation, and improve the accuracy of the extracted image features.
  • this step 601 can be implemented by the following 6011 and 6012.
  • steps 601 to 602 are based on the image processing model to preprocess the first sample face image, and obtain the first sample face image based on the preprocessed first sample face image The process of the image feature of the face image of the person and the first occlusion indication information corresponding to the image feature.
  • the image processing model can preprocess the first sample face image, remove information irrelevant to face recognition, or repair or correct some missing or wrong information.
  • the preprocessing process includes a face detection and alignment process.
  • the electronic device performs face detection on the first sample face image based on the image processing model, and crops the first sample face image based on the face detection result to obtain a preprocessed first sample face image.
  • the background content in the first sample face image has little effect on face recognition, and what face recognition requires is the image features of the face region.
  • the electronic device can determine the position of the key points of the face, and through the positions of the key points of the face, the face region can be cut out as the first sample face image after preprocessing.
  • the first sample face image during feature extraction removes redundant information, reduces the amount of calculation when extracting features, and the image features of the face region in the extracted image features are highlighted, and then use such image features to carry out Face recognition can effectively improve the recognition accuracy.
  • a face image template may be provided in which the position of each person's face is identified.
  • the electronic device can detect the coordinate positions of the left eye, right eye, nose, left mouth corner and right mouth corner in the face image, and then according to the coordinate positions of the five key points and the face position in the face image template
  • the first sample face image is cropped to obtain the preprocessed first sample face image.
  • the cropping process can be understood as aligning the face in the first sample face image to a uniform template position through affine transformation, and cropping to a fixed size.
  • the preprocessing process can be implemented by a related algorithm of face preprocessing, for example, the preprocessing process can be completed by using the MTCNN (Multi-Task Convolutional Neural Network, multi-task convolutional neural network) algorithm.
  • MTCNN Multi-Task Convolutional Neural Network, multi-task convolutional neural network
  • the preprocessing process may also include other methods.
  • the electronic device may also perform abnormal value processing, grayscale transformation, etc. on the first sample face image, which is not limited in this embodiment of the present application.
  • the electronic device After the electronic device preprocesses the first sample face image, it converts the image pixel information in the preprocessed first sample face image into image features. relationship, etc.
  • the feature extraction process can be implemented by a convolutional neural network (Convolutional Neural Networks, CNN), and the electronic device can input the preprocessed first sample face image into the convolutional neural network, and process it through convolution , to obtain the first overall image feature.
  • the convolutional neural network can also perform the above-mentioned preprocessing process.
  • the above-mentioned preprocessing process is performed by another convolutional neural network, which is not limited in this embodiment of the present application.
  • the image feature is expressed as (C, H, W), where C is the channel, H is the height, and W is the width.
  • the image features can be called convolutional features.
  • the convolutional neural network may include multiple convolutional layers. For the preprocessed first sample face image, multi-layer convolution operations can be used to obtain convolutional features (that is, image features) with very strong expressive ability. ).
  • C is consistent with the number of output channels of the last convolutional layer of the convolutional neural network.
  • the convolutional neural network can use any framework capable of accurate feature extraction, for example, the LResnet50E-IR framework can be used, of course, other frameworks, such as the GoogLeNet framework, can also be used. This embodiment of the present application does not limit the framework of the convolutional neural network.
  • the electronic device may firstly pre-train the convolutional neural network based on the unoccluded face image, and then perform pre-training based on step 600 after the pre-training.
  • the acquired first sample face image fine-tunes the model parameters of the image processing model.
  • the electronic device may train the convolutional neural network based on a second sample face image, where the face in the second sample face image is not occluded.
  • the convolutional neural network is pre-trained with clean face images, so that the convolutional neural network has prior knowledge of processing unoccluded face images, and then fine-tunes the images for unoccluded and occluded face images.
  • the model parameters of the processing model, the image processing model will be better for image processing.
  • the structure of the image processing model may be as shown in FIG. 9 , and the image processing model includes a convolutional neural network 901 , a decoder 902 , a recognition network 903 and an occlusion pattern prediction network 904 .
  • the convolutional neural network 901 is used to perform the step 601 .
  • the decoder 902 is configured to perform the following step 602, that is, the step of obtaining the first occlusion indication information.
  • the recognition network 903 is used to perform the following step 603, that is, to perform face recognition based on the image features obtained in step 601 and the first occlusion indication information obtained in step 602, to obtain a predicted recognition result of the first sample face image.
  • the occlusion pattern prediction network 904 is configured to perform the following step 605, that is, based on the first occlusion indication information obtained in step 602, classify the occlusion pattern of the first sample face image to obtain a predicted occlusion pattern.
  • the electronic device determines corresponding first occlusion indication information based on the image feature of the first sample face image, where the first occlusion indication information is used to indicate the image feature of the face occlusion area of the first sample face image.
  • step 603 is performed to remove the influence of this part of the image features to improve the accuracy of face recognition.
  • the first occlusion indication information may be in the form of a feature vector, and the value of each bit element in the feature vector is used to indicate whether each image feature element is affected by a face occlusion area. For example, the value of each bit element is used to represent the probability that the corresponding image feature element is affected by the face occlusion area.
  • the first occlusion indication information may take the form of a mask, and the first occlusion indication information may be referred to as a feature mask.
  • determining the first occlusion indication information may be a classification process, further processing the image features, and then classifying the processed image features to obtain the first occlusion indication information.
  • the electronic device performs convolution processing on the first overall image feature, classifies the image features after the convolution processing, and determines the first occlusion indication information corresponding to the first overall image feature.
  • the process of determining the first occlusion indication information is implemented by a decoder.
  • the decoder may also be referred to as a mask decoder, and the mask decoding is used to map image features (also known as convolutional features) to corresponding feature masks.
  • the structure of the image processing model 1000 may be as shown in FIG. 10 , wherein the decoder (Decoder) 1001 includes a Conv (Convolution, convolution) layer, a PRelu (Parametric Rectified Linear Unit, linear rectification function) layer , BN (Batch Normalization, batch normalization) layer and Sigmoid (S-shaped growth curve) layer.
  • the decoder 1001 can first perform convolution processing on the image features, then perform linear rectification processing on the convolution results, and then perform batch normalization processing.
  • the Sigmoid layer it is predicted that each image feature is retained (that is, not removed, not subject to The probability of the influence of the face occlusion area) is obtained, and the first occlusion indication information (that is, the feature mask) is obtained. Understandably, the image features can be mapped between [0, 1] through the Sigmoid layer. The probability that each image feature is retained is negatively related to the probability that each image feature is affected by the face occlusion area. The process of prediction through the sigmoid layer is essentially to predict the probability that each image feature is affected by the face occlusion area. The smaller the value of the corresponding bit in the occlusion indication information, the closer to 0.
  • the specific structure of the decoder 1001 can be shown in FIG. 11 .
  • the decoder 1001 decodes the corresponding feature mask M 1 from the feature X 1 generated by the previous convolutional network.
  • the function of M 1 is to find the polluted feature elements in X 1 , and remove these elements by multiplying the two to obtain a clean feature X′ 1 , which is used for subsequent recognition tasks.
  • Steps 601 and 602 are based on the image processing model to obtain the image feature of the first sample face image and the first occlusion indication information corresponding to the image feature.
  • the feature extraction and determination of the first occlusion indication information may not be determined based on the image feature, but directly process the first sample face image to determine the corresponding first occlusion indication information. This embodiment of the present application does not limit this.
  • the electronic device performs face recognition based on the image feature of the first sample face image and the first occlusion indication information, and obtains a predicted recognition result of the first sample face image.
  • the electronic device After the electronic device determines the first occlusion indication information, it knows which image features in the first overall image feature are affected by the face occlusion area, so that this part of the image features can be removed and then the face recognition can be performed, so that the recognition result is different. Affected by the face occlusion area, it will be more accurate.
  • step 603 is implemented by 6031 and 6032.
  • the first occlusion indication information has indicated the image features affected by the face occlusion area, the first overall image feature can be processed through the first occlusion indication information, and the affected image features are removed, so that the influence of face occlusion can be removed. , so as to perform accurate face recognition process.
  • the removal process may be: the electronic device multiplies the first overall image feature and the first occlusion indication information to obtain the first target image feature.
  • the first occlusion indication information can be in the form of a matrix or a vector. If a certain image feature is greatly affected, the value of the corresponding bit of the image feature in the first occlusion indication information is relatively small. The value is reduced, so that the image features affected by the face occlusion area in the first target image feature are weakened and can hardly be reflected, and the removal effect is achieved.
  • the electronic device can perform face recognition to determine the recognition result of the first sample face image.
  • the face recognition process may be a classification process, and the identity of the face is determined through classification, or the attribute or type of face is determined through classification.
  • the classification process is to obtain the matching degree between the first target image feature and the candidate face image feature, and determine the recognition result corresponding to the candidate face image feature with the largest matching degree as the predicted recognition result.
  • the feature vector f p of the test face that is, the first target image feature
  • the feature vector of each face in the database are calculated.
  • face recognition scenario there are two scenarios for face recognition, one is face identification scenario and the other is face authentication scenario.
  • the recognition process can be different.
  • face identification scenarios it is necessary to identify which face category in the database the test face belongs to.
  • This scheme uses the nearest neighbor classifier, that is, the category of the face with the highest similarity to the test face in the database, which is the category to which the test face belongs.
  • Other classifiers such as Support Vector Machines (SVM), can also be used.
  • SVM Support Vector Machines
  • This scheme uses threshold judgment, that is, when the similarity between the two is higher than a certain threshold, it is considered to be the same person, otherwise, it is considered not the same person. It is also possible to specifically learn a classifier for face authentication based on feature vectors.
  • the above-mentioned step 603 is the process of obtaining the predicted recognition result of the first sample face image.
  • the image processing model can also obtain the predicted recognition result through other processing methods. Face recognition, which is not specifically limited in this embodiment of the present application.
  • the electronic device acquires a recognition error based on the predicted recognition result and the target recognition result corresponding to the first sample face image.
  • the electronic device After the electronic device determines the predicted recognition result, it can compare it with the target recognition result to determine the gap between the two, and the gap is the recognition error.
  • the recognition error can be obtained through a loss function, and the loss function can be any loss function, for example, CosFace classification loss function, cross entropy loss function, L1, L2 isometric regression loss function or exponential loss function Wait.
  • the recognition error can be obtained through the CosFace classification loss function.
  • the embodiment of the present application does not specifically limit the acquisition method of the identification error.
  • the electronic device acquires first occlusion area information of the first sample face image based on the first occlusion indication information of the first sample face image.
  • the electronic device can also predict the occlusion mode of the first sample face image according to the first occlusion indication information.
  • the electronic device can also predict the occlusion mode of the first sample face image according to the first occlusion indication information.
  • the first occlusion area information is used to indicate whether each area in the first sample face image is occluded.
  • the first occlusion area information is used to indicate the occlusion condition of multiple image blocks in the first sample face image.
  • the electronic device matches the first occlusion area information with the occlusion area information corresponding to the at least two candidate occlusion modes to obtain at least two matching degrees; according to the at least two matching degrees, determine the prediction in the at least two candidate occlusion modes occlusion mode.
  • step 606 may determine the candidate occlusion pattern with the highest matching degree as the predicted occlusion pattern of the first sample face image.
  • At least two candidate occlusion modes may be set, that is, there are multiple candidate occlusion modes, and each candidate occlusion mode corresponds to occlusion area information.
  • the occlusion area information can be established when the occlusion mode is divided. For example, for the occlusion area information, if an area is occluded, the value of the corresponding bit of the area may be set to 0; if the area is not occluded, the value of the corresponding bit of the area may be set to 1. As shown in (a) of FIG. 7 , the value of each bit element in the occlusion area information of the unoccluded face image 701 may be 1, which is represented by black here. As shown in (b) of FIG.
  • the value of the position belonging to the occluded area is 1, which is represented by black here, and the value of the position belonging to the unobstructed area is 0, here Indicated in white.
  • the at least two candidate occlusion patterns may be stored in an occlusion pattern library, and during matching, the electronic device can match the occlusion area information to be matched this time with the data in the occlusion pattern library.
  • the determination process of the occlusion pattern can be realized by an occlusion pattern predictor (Occlusion Pattern Predictor), that is, the occlusion pattern prediction network shown in the above-mentioned FIG. 9 or the occlusion pattern predictor 1002 shown in FIG. 10 .
  • the occlusion pattern predictor 1002 is also an occlusion pattern prediction network.
  • the occlusion pattern prediction network may adopt a sequential structure of "BN-FC-BN", that is, the occlusion pattern prediction network can first standardize the first occlusion indication information, and then perform normalization processing on the The information is subjected to convolution processing, and then normalized after the convolution processing to obtain the predicted occlusion mode.
  • the dimension of the data output by this occlusion pattern prediction network is the same as the number of occlusion patterns. That is, the predicted occlusion mode can take the form of a multidimensional vector. The dimensions are the same as the number of occlusion modes. For example, if the number of occlusion modes is 101, the predicted occlusion mode may be represented by a 101-dimensional vector. The value of each bit element in the vector is used to represent the probability that the occlusion mode of the first sample face image is the candidate occlusion mode corresponding to the element.
  • Step 606 is a process of classifying the occlusion pattern of the first sample face image based on the first occlusion indication information of the first sample face image to obtain a predicted occlusion pattern.
  • the first occlusion indication information is It is converted into the first occlusion area information, and then the predicted occlusion mode is determined by matching the occlusion area information.
  • the occlusion indication information of at least two candidate occlusion modes may also be set in the electronic device, and the first occlusion indication information is directly matched with the occlusion indication information of the candidate occlusion modes.
  • the method is not specifically limited.
  • the electronic device obtains a classification error based on the predicted occlusion pattern and the target occlusion pattern corresponding to the first sample face image.
  • the classification error is used to measure the difference between the predicted occlusion mode and the target occlusion mode.
  • the process of obtaining the classification error is the same as the above step 605, and can be obtained through a loss function.
  • the classification error L pred can be determined using a cross-entropy loss function.
  • the classification error is obtained by the following formula (1):
  • N is the total number of first sample face images participating in training
  • C is the total number of occlusion patterns
  • pi is the probability that the first sample face image xi is correctly classified
  • f i is the first sample face
  • the corresponding feature vector of image xi . i and j are identifiers, and the values of i and j are both positive integers.
  • the electronic device updates the model parameters of the image processing model according to the identification error and the classification error.
  • the electronic device After the electronic device obtains the two errors, it can synthesize the two errors to update the model parameters, which can not only consider the robustness and accuracy of the face recognition of the image processing model, but also consider the image processing model to determine the first occlusion. Indicates the robustness and accuracy of the information, so that the performance of the trained model can be improved in both aspects.
  • the update process combining the two errors may include two ways, and the embodiment of the present application can adopt any way to implement the update step. Two options are provided below.
  • Manner 1 The electronic device obtains the product of the classification error and the weight of the classification error, takes the sum of the product and the recognition error as the target error, and updates the model parameters of the image processing model based on the target error.
  • a weight can be set for the classification error, and the weight of the classification error can be set by the relevant technical personnel according to the requirements.
  • the weight of the classification error can be the hyperparameter of the model, or the empirical value obtained by training the model before For example, the weight may be set to 1, and in other embodiments, the weight may also be obtained by updating together with the model parameters in this model training, which is not limited in this embodiment of the present application.
  • the acquisition process of the target error L total is realized by the following formula (2):
  • L cls is the loss function of face recognition (for example, the CosFace classification loss function is used); and L pred is the loss function of predicting the occlusion mode defined by formula (1).
  • w is a weight coefficient used to balance the importance of the two loss functions during training. Through cross-validation, it is found that the recognition effect is the best when the value of w is 1.0.
  • Method 2 Based on the respective weights of the classification error and the recognition error, perform a weighted sum of the classification error and the recognition error to obtain a target error, and update the model parameters of the image processing model based on the target error.
  • each error is set with a weight, and the setting of the weight is the same as that in the first method, and will not be repeated here.
  • the image processing model can provide image processing functions.
  • the electronic device in response to the image processing instruction, performs feature extraction on the target face image to be recognized based on the image processing model, based on the extracted second overall image feature and the second overall image feature corresponding to the second overall image feature. 2. Perform face recognition on the occlusion indication information, and obtain the image recognition result of the target face image.
  • the specific flow of image processing by the image processing model may refer to the following embodiment shown in FIG. 14 .
  • the process of using the model can be shown in Figure 12.
  • the electronic device can perform step 1201 of inputting the picture to be recognized, and then perform face detection and alignment based on the face preprocessing module, that is, step 1202, through the above step 1202
  • the preprocessed face image can be obtained, and the electronic device can continue to extract convolutional features (ie, image features) based on a deep convolutional network (CNN), that is, step 1203 .
  • CNN deep convolutional network
  • step 1204 the electronic device can generate a corresponding mask based on a mask decoder (Mask Decoder), and in step 1205, based on the generated mask The code and convolution features are multiplied to remove the polluted feature elements.
  • step 1206 the electronic device obtains the final face feature for identification based on the fully connected network (FC), and finally in step 1207, the electronic device can output the test face category or whether it belongs to the same category.
  • the model training process can also be shown in Figure 13.
  • the training process can include two steps. In step 1, the deep convolutional network is trained by ordinary face data. In step 2, on the basis of the trained model in the first step , again using the mixed face data to fine-tune the entire network parameters.
  • the embodiment of the present application introduces an occlusion mode, and the predicted occlusion mode of the first sample face image is determined according to the first occlusion indication information generated in the face recognition process, and the predicted occlusion mode of the first sample face image is occluded with the target marked on the first sample face image.
  • the image processing model can be trained to output more accurate first occlusion indication information, and then face recognition is performed based on the accurate first occlusion indication information, and the obtained recognition result is more accurate.
  • the image processing model The face image with occlusion can be processed more accurately, that is, the robustness of the image processing model is better.
  • the image processing model can directly perform feature extraction on the first sample face image, and then perform face recognition based on the extracted image features and the corresponding first occlusion indication information.
  • image processing which significantly reduces the amount of calculation, improves the running speed of the device, and can also reduce the number of models, and because the accuracy of image processing by the image processing model is not affected by external network factors, the accuracy is obtained. significantly improved.
  • FIG. 14 is a flowchart of an image processing method provided by an embodiment of the present application. Referring to FIG. 14 , the method includes:
  • the electronic device performs feature extraction on the target face image to be recognized to obtain a second overall image feature of the target face image.
  • This step 1401 is similar to the process of acquiring the first overall image feature in the above-mentioned step 601, and details are not described here.
  • the electronic device first preprocesses the target face image, and then performs feature extraction on the preprocessed target face image to obtain the second overall image feature of the target face image.
  • the preprocessing process may be: the electronic device performs face detection on the target face image, and based on the face detection result, cropping the target face image to obtain a preprocessed target face image .
  • the electronic device determines second occlusion indication information corresponding to the second overall image feature, where the second occlusion indication information is used to indicate the image feature of the face occlusion area of the target face image.
  • This step 1402 is the same as the above-mentioned step 602, and is not repeated here.
  • the process of determining the second occlusion indication information may be as follows: the electronic device performs convolution processing on the second overall image feature, and classifies the image features after convolution processing to obtain the second occlusion indication information .
  • the electronic device acquires the second target image feature according to the second overall image feature of the target face image and the second occlusion indication information.
  • the electronic device recognizes the face in the target face image based on the second target image feature.
  • the steps 1403 and 1404 are the same as the steps 6011 and 6012 in the above-mentioned step 603, and will not be repeated here.
  • the above step 1403 may be: based on the second occlusion indication information, the electronic device removes the image feature of the face occlusion area in the second overall image feature to obtain the second target image feature.
  • the removal method may be implemented by multiplication, and the electronic device multiplies the second overall image feature and the second occlusion indication information to obtain the second target image feature.
  • the above-mentioned image processing method can be implemented by an image processing model, and the electronic device can input the target face image into the image processing model, and the image processing model can perform the feature extraction, determine the second occlusion indication information, obtain the The second target image feature and the face recognition process output the recognition result.
  • FIG. 15 is a schematic structural diagram of an image processing model training apparatus provided by an embodiment of the present application. Referring to FIG. 15 , the apparatus includes:
  • the first obtaining module 1501 is configured to obtain, based on the image processing model, the predicted recognition result of the first sample face image and first occlusion indication information, where the first occlusion indication information is used to indicate the identity of the first sample face image.
  • a second obtaining module 1502 configured to obtain a recognition error based on the predicted recognition result and the target recognition result corresponding to the first sample face image
  • the third obtaining module 1503 is configured to obtain a classification error based on the first occlusion indication information and the target occlusion mode corresponding to the first sample face image, wherein the occlusion mode of the first sample face image is used for Indicate the location and size of the face occlusion area;
  • the updating module 1504 is configured to update the model parameters of the image processing model according to the recognition error and the classification error.
  • the third obtaining module 1503 is used to:
  • the classification error is obtained based on the predicted occlusion pattern and the target occlusion pattern.
  • the third obtaining module 1503 is used to:
  • first occlusion area information based on the first occlusion indication information, where the first occlusion area information is used to indicate the occlusion condition of multiple image blocks in the first sample face image;
  • the predicted occlusion mode is determined among the at least two candidate occlusion modes according to the at least two matching degrees.
  • the update module 1504 is configured to perform any of the following:
  • the classification error and the recognition error are weighted and summed to obtain a target error, and based on the target error, the model parameters of the image processing model are updated .
  • the first obtaining module 1501 includes: a first obtaining unit and an identifying unit;
  • the first obtaining unit is configured to obtain, based on the image processing model, a first overall image feature of the first sample face image and first occlusion indication information corresponding to the first overall image feature;
  • the recognition unit is configured to perform face recognition based on the first overall image feature and the first occlusion indication information, and obtain a predicted recognition result of the first sample face image.
  • the first acquisition unit includes: a feature extraction subunit and a determination subunit;
  • the feature extraction subunit is configured to perform feature extraction on the first sample face image based on the image processing model to obtain the first overall image feature;
  • the determining subunit is configured to determine first occlusion indication information corresponding to the first overall image feature.
  • the feature extraction subunit is used to:
  • the convolved image features are classified to obtain the first occlusion indication information.
  • the identifying unit includes: removing a subunit and identifying a subunit;
  • the removing subunit is configured to remove the image feature of the face occlusion area in the first overall image feature based on the first occlusion indication information to obtain the first target image feature;
  • the identifying subunit is configured to perform face recognition on the first sample face image according to the characteristics of the first target image to obtain the predicted recognition result.
  • the removal subunit is configured to multiply the first overall image feature and the first occlusion indication information to obtain the first target image feature.
  • the first obtaining unit further includes a preprocessing subunit
  • the preprocessing subunit is configured to preprocess the first sample face image based on the image processing model
  • the feature extraction subunit and the determination subunit are configured to acquire the first overall image feature and the first occlusion indication information based on the preprocessed first sample face image.
  • the preprocessing subunit is used to:
  • the first sample face image is cropped to obtain a preprocessed first sample face image.
  • the image processing model includes a convolutional neural network, a decoder, a recognition network, and an occlusion pattern prediction network;
  • the convolutional neural network is used for performing the preprocessing and the acquiring steps of the first overall image feature
  • the decoder is configured to perform the step of obtaining the first occlusion indication information
  • the recognition network is configured to perform face recognition based on the first overall image feature and the first occlusion indication information, and obtain a predicted recognition result of the first sample face image;
  • the occlusion mode prediction network is configured to determine the predicted occlusion mode of the first sample face image based on the first occlusion indication information.
  • the apparatus further includes a training module for training the convolutional neural network based on a second sample face image, where the face is not occluded in the second sample face image .
  • the apparatus further includes an identification module, which is configured to, in response to the image processing instruction, perform feature extraction on the target face image to be identified based on the image processing model, and based on the extracted second overall image
  • the second occlusion indication information corresponding to the feature of the second overall image feature is used for face recognition, and an image recognition result of the target face image is obtained.
  • the embodiment of the present application introduces an occlusion mode, and the predicted occlusion mode of the first sample face image and the target occlusion mode corresponding to the first sample face image are determined according to the first occlusion indication information generated in the face recognition process.
  • the image processing model can be trained to determine more accurate first occlusion indication information, and then face recognition is performed based on the accurate first occlusion indication information, and the obtained recognition result is more accurate, and the image processing model can be more accurate. Accurately process occluded face images, that is, the robustness of the image processing model is better.
  • the image processing model can directly process the first sample face image to obtain the recognition result, and can perform image processing end-to-end without the aid of an external network, thus significantly reducing the amount of calculation and improving the running speed of the device. It can also effectively reduce the number of models, and since the accuracy of the image processing model for processing images is not affected by external network factors, the accuracy is significantly improved.
  • the image processing model training apparatus provided in the above-mentioned embodiments only uses the division of the above-mentioned functional modules as an example to illustrate the image processing model training.
  • the above-mentioned functions can be assigned to different functions as required.
  • the module is completed, that is, the internal structure of the image processing model training device is divided into different functional modules, so as to complete all or part of the functions described above.
  • the image processing model training apparatus provided in the above embodiments and the image processing model training method embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • FIG. 16 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 1600 may be a portable mobile terminal, such as a smart phone, a tablet computer, an MP3 (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, a dynamic picture expert Video Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop.
  • Terminal 1600 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 1600 includes: a processor 1601 and a memory 1602 .
  • the processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1601 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1601 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1601 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1602 may include one or more computer-readable storage media, which may be non-transitory. Memory 1602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1601 to implement the image processing provided by the method embodiments in this application. Model training method or image processing method.
  • FIG. 16 does not constitute a limitation on the terminal 1600, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • FIG. 17 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1700 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1702, wherein at least one piece of program code is stored in the memory 1702, and the at least one piece of program code is loaded and executed by the processor 1701 to realize the image processing model training method provided by the above-mentioned various method embodiments or image processing method.
  • the server can also have components such as wired or wireless network interfaces and input and output interfaces for input and output, and the server can also include other components for implementing device functions, which are not described here.
  • a computer-readable storage medium such as a memory including at least one piece of program code, and the above-mentioned at least one piece of program code is executable by a processor to complete the image processing model training method or image in the above-mentioned embodiments.
  • the computer-readable storage medium can be a read-only memory (Read-Only Memory, referred to as: ROM), a random access memory (Random Access Memory, referred to as: RAM), a read-only optical disk (Compact Disc Read-Only Memory, referred to as: CD-ROM), magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product or computer program is also provided, the computer program product or the computer program including one or more pieces of program codes stored in a computer-readable storage medium.
  • One or more processors of the electronic device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the electronic device can execute the above image Process model training methods or image processing methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本申请公开了一种图像处理模型训练方法、装置、设备及存储介质,属于人工智能技术领域。该方法包括:基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,第一遮挡指示信息用于指示第一样本人脸图像的人脸遮挡区域的图像特征;基于预测识别结果和第一样本人脸图像对应的目标识别结果,获取识别误差;基于第一遮挡指示信息和第一样本人脸图像对应的目标遮挡模式,获取分类误差,其中,第一样本人脸图像的遮挡模式用于指示人脸遮挡区域的位置以及尺寸;根据识别误差和分类误差,对图像处理模型的模型参数进行更新。

Description

图像处理模型训练方法、装置、设备及存储介质
本申请要求于2020年08月20日提交的申请号为202010845864.9、发明名称为“图像处理模型训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
技术领域
本申请涉及人工智能技术领域,特别涉及一种图像处理模型训练方法、装置、设备及存储介质。
背景技术
随着计算机技术的发展,人工智能应用在各个领域,基于人工智能来代替人的工作,能够大大提高业务处理效率。在图像处理方面,基于人工智能技术能够对图像处理模型进行训练,得到训练好的图像处理模型,之后,将待识别的图像输入训练好的图像处理模型,即可得到相应的处理结果。
发明内容
本申请实施例提供了一种图像处理模型训练方法、装置、设备及存储介质。所述技术方案如下:
一方面,提供了一种图像处理模型训练方法,所述方法包括:
基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,所述第一遮挡指示信息用于指示所述第一样本人脸图像的人脸遮挡区域的图像特征;
基于所述预测识别结果和所述第一样本人脸图像对应的目标识别结果,获取识别误差;
基于所述第一遮挡指示信息和所述第一样本人脸图像对应的目标遮挡模式,获取分类误差,其中,所述第一样本人脸图像的遮挡模式用于指示所述人脸遮挡区域的位置以及尺寸;
根据所述识别误差和所述分类误差,对所述图像处理模型的模型参数进行更新。
另一方面,提供了一种图像处理方法,所述方法包括:
响应于图像处理指令,对待识别的目标人脸图像进行特征提取,得到所述目标人脸图像的第二整体图像特征;
确定所述第二整体图像特征对应的第二遮挡指示信息,所述第二遮挡指示信息用于指示所述目标人脸图像的人脸遮挡区域的图像特征;
根据所述第二遮挡指示信息,去除所述第二整体图像特征中所述人脸遮挡区域的图像特征,得到第二目标图像特征;
基于所述第二目标图像特征,对所述目标人脸图像进行人脸识别。
另一方面,提供了一种图像处理模型训练装置,所述装置包括:
第一获取模块,用于基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,所述第一遮挡指示信息用于指示所述第一样本人脸图像的人脸遮挡区域的图 像特征;
第二获取模块,用于基于所述预测识别结果和所述第一样本人脸图像对应的目标识别结果,获取识别误差;
第三获取模块,用于基于所述第一遮挡指示信息和所述第一样本人脸图像对应的目标遮挡模式,获取分类误差,其中,所述第一样本人脸图像的遮挡模式用于指示所述人脸遮挡区域的位置以及尺寸;
更新模块,用于根据所述识别误差和所述分类误差,对所述图像处理模型的模型参数进行更新。
另一方面,提供了一种电子设备,所述电子设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现上述的图像处理模型训练方法。
另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现上述的图像处理模型训练方法。
另一方面,提供了一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括一条或多条程序代码,所述一条或多条程序代码存储在计算机可读存储介质中。电子设备的一个或多个处理器能够从计算机可读存储介质中读取所述一条或多条程序代码,所述一个或多个处理器执行所述一条或多条程序代码,使得电子设备能够执行上述的图像处理模型训练方法。
另一方面,提供了一种电子设备,所述电子设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现上述的图像处理方法。
另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现上述的图像处理方法。
另一方面,提供了一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括一条或多条程序代码,所述一条或多条程序代码存储在计算机可读存储介质中。电子设备的一个或多个处理器能够从计算机可读存储介质中读取所述一条或多条程序代码,所述一个或多个处理器执行所述一条或多条程序代码,使得电子设备能够执行上述的图像处理方法。
本申请实施例引入了遮挡模式,通过人脸识别过程中产生的第一遮挡指示信息,确定该第一样本人脸图像的预测遮挡模式,并与该第一样本人脸图像对应的目标遮挡模式做对比,以此能够训练图像处理模型确定出更准确的第一遮挡指示信息,进而基于准确的第一遮挡指示信息进行人脸识别,得到的识别结果也就更准确,该图像处理模型能够更准确地处理存在遮挡的人脸图像,也即是该图像处理模型的鲁棒性更好。换言之,该图像处理模型能够直接对第一样本人脸图像进行处理得到识别结果,无需借助外部网络,能够端到端地进行图像处理,因此显著地减少了计算量,提升了设备的运行速度,也能够有效减少模型的个数,且由于该图像处理模型处理图像的准确性不受外部网络因素影响,因此准确性得到了显著提升。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还能够根据这些附图获得其他的附图。
图1是本申请实施例提供的一种图像处理模型训练方法的实施环境的示意图;
图2是本申请实施例提供的一种考勤系统的示意图;
图3是本申请实施例提供的一种图像处理模型训练方法的流程图;
图4是相关技术提供的一种图像处理模型训练方法的示意图;
图5是相关技术提供的一种图像处理模型使用方法的示意图;
图6是本申请实施例提供的一种图像处理模型训练方法的流程图;
图7是本申请实施例提供的一种人脸图像的示意图;
图8是本申请实施例提供的一种人脸遮挡区域和遮挡模式数量确定的示意图;
图9是本申请实施例提供的一种图像处理模型的结构示意图;
图10是本申请实施例提供的一种图像处理模型的结构示意图;
图11是本申请实施例提供的一种解码器的结构示意图;
图12是本申请实施例提供的一种图像处理模型使用过程的示意图;
图13是本申请实施例提供的一种图像处理模型训练过程的示意图;
图14是本申请实施例提供的一种图像处理方法的流程图;
图15是本申请实施例提供的一种图像处理模型训练装置的结构示意图;
图16是本申请实施例提供的一种终端的结构示意图;
图17是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一图像能够被称为第二图像,并且类似地,第二图像能够被称为第一图像。第一图像和第二图像都是图像,并且在某些情况下,是单独且不同的图像。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个数据包是指两个或两个以上的数据包。
还应理解,本文中所使用的术语“和/或”是指并且涵盖相关联的所列出的项目中的一个或多个项目的任何和全部可能的组合。术语“和/或”,是一种描述关联对象的关联关系,表示能够存在三种关系,例如,A和/或B,能够表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解,根据A确定B并不意味着仅仅根据A确定B,还能够根据A和/或其它信息确定B。
下面对本申请涉及到的名词进行说明。
人脸遮挡区域是指图像中人脸被遮挡的区域。
卷积特征:是深度卷积网络的卷积层输出,通常是具有C个通道,高为H,宽为W的三维张量,即f(·)∈R C*H*W。卷积特征元素指的是坐标为(C,H,W)的张量元素。其中,张量概念是矢量概念的推广,矢量是一阶张量。在一些实施例中,张量是一个多线性函数,可用来表示一些矢量、标量和其他张量之间的线性关系。
特征掩码:是与卷积特征大小相同的三维张量,特征掩码中每位元素的取值在[0,1]之间。在一些实施例中,该特征掩码的作用是去除被污染的特征元素,该被污染的特征元素也就是指人脸遮挡区域的特征元素。
端到端系统:是指该系统无需借助外部网络或系统的帮助,只依靠自身系统,从输入得到预期的输出。端到端也即是指上述仅依靠自身,从输入得到预期的输出的方式。
鲁棒性:鲁棒是Robust的音译,是健壮和强壮的意思。在计算机方面,它是指在异常和危险情况下系统生存的能力。比如说,计算机软件在输入错误、磁盘故障、网络过载或有意攻击情况下,能否不死机、不崩溃,就是该计算机软件的鲁棒性。所谓“鲁棒性”,也是指控制系统在参数(例如,结构或大小)摄动下,维持其它某些性能的特性。
下面对本申请的实施环境进行说明。
图1是本申请实施例提供的一种图像处理模型训练方法的实施环境的示意图。该实施环境包括终端101,或者,该实施环境包括终端101和图像处理平台102。终端101通过无线网络或有线网络与图像处理平台102相连。
终端101能够是智能手机、游戏主机、台式计算机、平板电脑、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器或MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器,膝上型便携计算机,安检设备,考勤设备中的至少一种。终端101安装和运行有支持图像处理模型训练的应用程序,例如,该应用程序能够是安检应用、考勤应用、系统应用、即时通讯应用、购物应用、在线视频应用、社交应用。
在一些实施例中,该终端101具有图像采集功能和图像处理功能,能够对采集到的图像进行图像处理,并根据处理结果执行相应的功能。该终端101能够独立完成该工作,也能够通过图像处理平台102为其提供数据服务或图像处理服务。
在一些实施例中,图像处理平台102能够获取样本人脸图像训练图像处理模型,在终端101采集到图像后,将采集到的图像发送至图像处理平台102,由图像处理平台102基于训练好的图像处理模型为终端101提供图像处理服务。
图像处理平台102包括一台服务器、多台服务器、云计算平台和虚拟化中心中的至少一种。图像处理平台102用于为支图像处理模型训练的应用程序提供后台服务。在一些实施例中,图像处理平台102承担主要处理工作,终端101承担次要处理工作;或者,图像处理平台102承担次要处理工作,终端101承担主要处理工作;或者,图像处理平台102或终端101分别能够单独承担处理工作。或者,图像处理平台102和终端101两者之间采用分布式计算架构进行协同计算。
在一些实施例中,该图像处理平台102包括至少一台服务器1021以及数据库1022,该数据库1022用于存储数据,在本申请实施例中,该数据库1022中能够存储样本人脸图像,为至少一台服务器1021提供数据服务。
服务器1021能够是独立的物理服务器,也能够是多个物理服务器构成的服务器集群或者分布式系统,还能够是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端能够是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。
本领域技术人员能够知晓,上述终端101、服务器1021的数量能够更多或更少。比如上 述终端101、服务器1021仅为一个,或者上述终端101、服务器1021为几十个或几百个,或者更多数量,本申请实施例对终端或服务器的数量不加以限定。另外,本申请实施例对终端或服务器的设备类型也不加以限定。
下面对本申请的应用场景进行说明。
利用本申请实施例提供的图像处理模型训练方法,训练得到图像处理模型后,该图像处理模型能够提供图像处理服务,该图像处理服务能够应用于任意的人脸识别场景,在任意的人脸识别场景中,无论采集到的人脸图像中人脸是否被遮挡,均可以由该图像处理模型准确地进行识别。例如,该图像处理模型可以应用于考勤系统、安检系统、手机或电脑的人脸解锁、或人脸识别支付等人脸识别场景。用户只需要在系统初始建立时上传一张正面无遮挡人脸图像,作为待识别图像存储在系统数据库中,识别时只需获取用户的待识别图像即可,无需其他多余操作。
例如,在考勤这个应用场景中,如图2所示,在考勤系统200中,用户201只需要位于考勤设备202的摄像头前面,由考勤设备202为该用户201采集人脸图像203,然后对该人脸图像203进行人脸识别,确定该用户201的身份信息204,即可记录该身份信息204的用户已进行考勤打卡。当然,在其他应用场景中,可以将考勤设备202替换为其他设备,比如安检系统中,该考勤设备能够被替换为安检设备,在识别得到身份信息后,能够显示“身份验证通过”,或者,安检设施放行。
图3是本申请实施例提供的一种图像处理模型训练方法的流程图,该方法应用于电子设备中,该电子设备为终端或服务器,参见图3,该方法包括以下步骤。
301、电子设备基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,该第一遮挡指示信息用于指示该第一样本人脸图像的人脸遮挡区域的图像特征。
第一样本人脸图像是包括有人脸的图像,在此将包括有人脸的图像作为样本来训练图像处理模型。样本(specimen)是指观测或调查的一部分个体。
在本申请实施例中,图像处理模型用于对输入的第一样本人脸图像进行处理,输出预测识别结果。该第一样本人脸图像可以包括不存在遮挡的人脸图像,也可以包括存在遮挡的人脸图像。在训练过程中,基于该第一样本人脸图像对图像处理模型进行训练,相应地,在使用过程中,无论待识别的目标人脸图像是否存在遮挡,该图像处理模型均能够准确进行人脸识别。
在图像处理过程中,图像处理模型能够对第一样本人脸图像进行特征提取,再确定出哪些图像特征是受人脸遮挡区域影响的图像特征,将其去除后,即可基于未被遮挡人脸区域的图像特征进行人脸识别。上述第一遮挡指示信息就是用来标识出这些受人脸遮挡区域影响的图像特征,也可以将这些受人脸遮挡区域影响的图像特征作为被污染的图像特征。
该步骤301中,该图像处理模型的模型参数可以为初始值,该初始值可以通过初始化得到,或者,也可以是通过其他的第一样本人脸图像进行预训练得到的初始值,本申请实施例对此不作限定。
302、电子设备基于该预测识别结果和该第一样本人脸图像对应的目标识别结果,获取识别误差。
该识别误差用于确定是否需要调整图像处理模型的模型参数以及如何调整模型参数,以提高图像处理模型处理图像的准确性。
在训练过程中,该图像处理模型输出的预测识别结果为该图像处理模型预测的识别结果,也可以将其称为“预测值”。该预测识别结果的准确性与该图像处理模型处理图像的准确性一致。每个第一样本人脸图像所标注的目标识别结果是真实的、正确的识别结果,也可以将其称为“真值”。通过对比“预测值”和“真值”得到的识别误差,能够衡量该预测识别结果的准确性,也就能够衡量该图像处理模型处理图像的准确性。
可以理解地,如果该识别误差比较大,则该图像处理模型处理图像的准确性比较差;如果该识别误差比较小,则该图像处理模型处理图像的准确性比较好。
303、电子设备基于该第一遮挡指示信息和该第一样本人脸图像对应的目标遮挡模式,获取分类误差。
其中,该第一样本人脸图像的遮挡模式用于指示该人脸遮挡区域的位置以及尺寸。
在一些实施例中,基于该第一遮挡指示信息和该第一样本人脸图像对应的目标遮挡模式,获取分类误差,包括:基于该第一遮挡指示信息,确定该第一样本人脸图像的预测遮挡模式。之后,基于该预测遮挡模式和该目标遮挡模式,获取所分类误差。其中,确定该第一样本人脸图像的预测遮挡模式,也即是对该第一样本人脸图像的遮挡模式进行分类,得到该预测遮挡模式。
不同的第一样本人脸图像中人脸遮挡区域的位置或尺寸可能不同,自然地,受人脸遮挡区域影响的图像特征不同,也即是第一遮挡指示信息不同。根据人脸遮挡区域的位置和尺寸的不同,设置有不同的遮挡模式,每种遮挡模式对应一种人脸遮挡区域的位置和尺寸。
预测遮挡模式是“预测值”,目标遮挡模式是真实的、正确的遮挡模式,也即是一种“真值”,通过对比该遮挡模式的“预测值”和“真值”,得到的分类误差能够衡量预测遮挡模式的准确性,由于该预测遮挡模式是基于第一遮挡指示信息确定的,因此也就能够衡量第一遮挡指示信息的准确性。
在训练过程中,加入遮挡模式的监督学习,能够监督图像处理模型学习到准确的第一遮挡指示信息的能力,进而根据准确的第一遮挡指示信息与图像特征进行人脸识别,得到的识别结果也就更准确。
304、电子设备根据该识别误差和该分类误差,对该图像处理模型的模型参数进行更新。
在训练过程中,既考虑到了识别误差,又考虑到了分类误差,其中,该识别误差训练该图像处理模型具备良好的人脸识别能力,该分类误差训练该图像处理模型输出更准确的第一遮挡指示信息,进而提升人脸识别的准确性。
一方面,本申请实施例引入了遮挡模式,通过人脸识别过程中产生的第一遮挡指示信息,确定该第一样本人脸图像的预测遮挡模式,并与该第一样本人脸图像所标注的目标遮挡模式做对比,以此训练图像处理模型输出更准确的遮挡指示信息,进而基于准确的遮挡指示信息进行人脸识别,得到的识别结果也就更准确,换言之,该图像处理模型能够更准确地处理存在遮挡的人脸图像,也即是该图像处理模型的鲁棒性更好。另一方面,该图像处理模型能够直接对第一样本人脸图像进行处理得到识别结果,无需借助外部网络,能够端到端地进行图像处理,因此显著地减少了计算量,提升了设备的运行速度,也能够有效减少模型的个数,且由于该图像处理模型处理图像的准确性不受外部网络因素影响,因此准确性得到了显著提升。
下面提供一种相关技术中图像处理模型的训练方法,并对该相关技术与本申请提供的方法进行对比分析。
图4和图5分别是相关技术提供的一种图像处理模型训练方法和图像处理模型使用方法,图4示出了一种成对差分孪生网络,该成对差分孪生网络能够显式地学习人脸遮挡区域与受该人脸遮挡区域影响的图像特征之间的映射关系,可以将其称为学习掩码生成器(Learn Mask Generators)。基于该映射关系,建立一个遮挡块(人脸遮挡区域)-掩码对应的字典,该掩码即用于指示人脸遮挡区域受影响大的图像特征,该过程也即是建立掩码字典(Establish Mask Dictionary)的过程。该掩码字典(Mask Dictionary)中的每个索引项表示人脸上某块区域发生遮挡时,受影响大的图像特征;如果将图像特征中的每个元素称之为卷积特征元素,该字典中的每个索引项表示人脸上某块区域发生遮挡时,受影响大的顶层卷积特征元素。如图5所示,在测试时,先利用全卷积网络(Fully Convolutional Network,FCN)检测输入的人脸 图像的人脸遮挡区域,然后根据上述字典,能够得到该遮挡条件下应该被去除的卷积特征元素,利用掩码去除这些卷积特征元素后再进行识别。
相关技术中通过外部网络(上述成对差分孪生网络)学习映射关系,将其建立为字典,后续在对图像处理模型进行训练以及使用图像处理模型时,均需要通过该外部网络检测人脸遮挡区域,通过查询字典得到映射后的掩码,该图像处理模型基于映射后的掩码再进行人脸识别。图像处理模型依赖于外部网络,且外部网络与图像处理模型分开训练。由于外部网络的存在导致计算量显著增加,设备的运行速度较慢。该外部网络检测人脸遮挡区域的精度,也极大地影响应该被去除的特征元素的确定,即如果人脸遮挡区域检测不准确,则后续对被污染的特征元素的去除也会不准确,从而干扰到最后的人脸识别。
一方面,本申请实施例提供的图像处理模型为一种端到端系统,无需借助外部网络,该图像处理模型能够基于输入的人脸图像,动态学习人脸遮挡区域与受该人脸遮挡区域影响的图像特征之间的映射关系,这样该图像处理模型能够直接基于输入的人脸图像输出识别结果,能够显著地减少计算量,提升设备的运行速度,也能够有效减少模型的个数,且该图像处理模型处理图像的准确性不受外部网络因素影响,准确性得到了显著提升。
另一方面,本申请实施例通过引入遮挡模式,训练图像处理模型能够输出更准确的第一遮挡指示信息,进而提高了该图像处理模型处理图像的准确性,该图像处理模型能够更准确地处理存在遮挡的人脸图像,也即是该图像处理模型的鲁棒性更好。
另外,相关技术中建立遮挡块-掩码对应字典的过程中,将人脸分成了9个不同区域。对于每一个区域均需要独立训练一个网络来学习该区域的映射。也即是,在建立字典时需要训练9个不同的模型,大大增加了训练时长和训练成本,且由于模型个数较多进而占据的存储空间会非常大,不容易将其部署到实际应用上去。
本申请实施例仅需要训练一个图像处理模型,该图像处理模型能够动态地根据人脸图像确定出它的第一遮挡指示信息,能够大大降低模型训练时长和训练成本,该图像处理模型相对于相关技术中的9个模型,更容易部署到各种类型的设备上以实现相应的图像处理功能,因而,该图像处理模型的适用性和实用性更好。
图6是本申请实施例提供的一种图像处理模型训练方法的流程图,该方法应用于电子设备中,该电子设备为终端或服务器,参见图6,该方法包括以下步骤。
600、电子设备获取第一样本人脸图像,该第一样本人脸图像标注有目标识别结果和目标遮挡模式。
该第一样本人脸图像可以包括未遮挡人脸图像,也可以包括遮挡人脸图像。其中,未遮挡人脸图像是指人脸未被遮挡的图像,可以称其为干净的人脸图像。遮挡人脸图像是指人脸被遮挡的图像,可以称其为带遮挡的人脸图像。
例如,如图7中的(a)所示,图像701中人脸完整体现,并不存在遮挡,该图像701也即是未遮挡人脸图像、干净的人脸图像。如图7中的(b)所示,图像702中人脸的一部分被另一张图片或其他图案遮挡,从图像702中仅能够清楚看到部分人脸,因而,该图像702也即是遮挡人脸图像、带遮挡的人脸图像。
对于该第一样本人脸图像的获取过程,根据该第一样本人脸图像的存储地址不同,电子设备可以通过多种方式获取第一样本人脸图像。在一些实施例中,该第一样本人脸图像可以存储于图像数据库中,电子设备需要对图像处理模型进行训练时,可以从该数据库中提取该第一样本人脸图像。
在另一些实施例中,该第一样本人脸图像可以为网站中的资源,电子设备能够从目标网站下载第一样本人脸图像。
在另一些实施例中,该第一样本人脸图像可以存储于该电子设备中,例如,该第一样本人脸图像为其他设备发送至该电子设备的历史图像,或者,该电子设备生成的图像,电子设 备可以从本地存储空间中提取该第一样本人脸图像。
上述提供了获取第一样本人脸图像的几种可能实现方式,电子设备还可以通过其他方式获取第一样本人脸图像,本申请实施例对第一样本人脸图像的获取方式不作具体限定。
对于目标识别结果,该图像处理模型的识别功能不同,该目标识别结果也不同。例如,该图像处理模型用于对图像中人脸进行身份认证,相应地,该目标识别结果为身份认证信息。又例如,该图像处理模型用于对图像中人脸的人脸属性或人脸类型进行识别,比如判断人脸是否带有眼镜,又比如判断人脸的性别等,相应地,该目标识别结果为人脸属性或人脸类型。
在一些实施例中,该目标识别结果可以与该第一样本人脸图像存储在一起,该第一样本人脸图像标注有该目标识别结果。例如,电子设备获取第一样本人脸图像时,可以获取第一样本人脸图像以及对应的目标识别结果。
在另一些实施例中,该目标识别结果基于标注操作确定。例如,可以由相关技术人员对第一样本人脸图像进行标注,标注出每个第一样本人脸图像对应的目标识别结果。
对于目标遮挡模式,该目标遮挡模式用于指示该第一样本人脸图像中人脸遮挡区域的位置以及尺寸。第一样本人脸图像可以包括多种遮挡模式,该目标遮挡模式即为多种遮挡模式中的一种。
在一些实施例中,第一样本人脸图像或待识别的目标人脸图像可以包括至少两个区域,每种遮挡模式对应有遮挡区域信息,该遮挡区域信息用于指示上述至少两个区域中每个区域是否被遮挡。根据人脸遮挡区域的不同,可以划分得到多种遮挡模式。
在一些实施例中,该第一样本人脸图像或待识别的目标人脸图像包括能够分成K*K个区域,每个区域都代表了人脸中一个可能被遮挡的小块(也即是图像块)。其中,K为大于1的整数。这样在不同区域被遮挡时,能够得到不同的遮挡模式。
电子设备可以根据划分出的区域的数量,获取遮挡模式的数量。在一些实施例中,每个区域都存在两种可能性,即遮挡与不被遮挡,则可以将遮挡情况划分为2 K*K种不同的遮挡模式。例如,当K取4的时候,会有65536中不同的遮挡模式。
在一些实施例中,考虑到遮挡模式与区域数量之间的指数关系,如果增加K以提高人脸划分的精细度,则会带来指数型的遮挡模式数量的增长。这样可能会影响到图像处理速度。在此提出一种新的遮挡模式确定机制。通过对人脸图像的观察发现,邻近的区域通常具有相似的遮挡状态,也即是,一个区域被遮挡时,与该区域邻近的区域被遮挡的可能性比较大。比如当左眼区域被遮挡时,右眼区域也有比较大的概率被遮挡,在此称这个特性为邻近性。基于该邻近性,可以对人脸遮挡区域进行约束,进而能够确定出数量少的遮挡模式。
在一些实施例中,在此约束遮挡模式覆盖m*n个区域,其中,m,n的取值范围为[1,K],m和n分别为人脸遮挡区域的宽和高。例如,如图8中的(a)所示,该图中示出了当K取4时的几种遮挡模式下人脸遮挡区域801(以加粗框线标出)的位置和尺寸。一个人脸图像中的人脸遮挡区域为连通域,且该人脸遮挡区域为四边形区域。由此电子设备可以根据人脸划分的区域的数量,获取到遮挡模式的数量。例如,如图8中的(b)所示,当K取4时,人脸遮挡区域的尺寸发生变化时对应的遮挡模式的数量802。该矩阵中第(i,j)位置上的值代表的是,人脸遮挡区域的尺寸为(i*j)时遮挡模式的数量。比如,(1,1)位置上的值16,该值是指人脸遮挡区域的尺寸为1*1时,遮挡模式可以包括16种,也即是,该人脸遮挡区域位于16个区域时对应16种遮挡模式。矩阵中其它位置上的值的计算方式同理,在此不一一列举。则当K取4时,能够确定出101种遮挡模式。在一些实施例中,图8中的(b)示出了人脸存在遮挡的情况,共16+12+12+8+8+9+4+6+6+4+3+4+3+2+2+1=100种,还有一种人脸不存在遮挡的情况。
601、电子设备基于图像处理模型,对第一样本人脸图像进行特征提取,得到该第一样本人脸图像的图像特征。
其中,该第一样本图像的图像特征,在本文中也称为第一整体图像特征。
该图像处理模型为初始模型,电子设备可以将第一样本人脸图像输入图像处理模型中,由图像处理模型对第一样本人脸图像进行处理。在一些实施例中,电子设备可以对第一样本人脸图像进行特征提取,以图像特征来对第一样本人脸图像的像素特点或像素之间关系进行表达。
在一些实施例中,电子设备在对第一样本人脸图像进行特征提取之前,可以先对第一样本人脸图像进行预处理,再对预处理后的第一样本人脸图像进行特征提取。通过预处理过程使得特征提取时的第一样本人脸图像更符合特征提取规范,以提高处理效率,降低计算复杂度和计算量,提高提取到的图像特征的准确性。在一些实施例中,该步骤601可以通过下述6011和6012实现。在该实现方式中,该步骤601至步骤602也即是基于该图像处理模型,对该第一样本人脸图像进行预处理,基于预处理后的第一样本人脸图像,获取该第一样本人脸图像的图像特征以及该图像特征对应的第一遮挡指示信息的过程。
6011、基于该图像处理模型,对该第一样本人脸图像进行预处理。
该图像处理模型能够对第一样本人脸图像进行预处理,将与人脸识别无关的信息去除,或者,将一些缺失的信息或有误的信息进行修补或修正。
在一些实施例中,该预处理过程包括人脸检测和对齐过程。
在一些实施例中,电子设备基于该图像处理模型,对该第一样本人脸图像进行人脸检测,基于人脸检测结果,对该第一样本人脸图像进行裁剪,得到预处理后的第一样本人脸图像。
可以理解地,第一样本人脸图像中的背景内容对人脸识别几乎没有影响,人脸识别所需的是人脸区域的图像特征。通过人脸检测,电子设备能够确定出人脸关键点的位置,通过人脸关键点的位置,可以将人脸区域裁剪出来作为预处理后的第一样本人脸图像。这样特征提取时的第一样本人脸图像去除了冗余信息,减小了提取特征时的计算量,且提取到的图像特征中人脸区域的图像特征被突出,进而通过这样的图像特征进行人脸识别,能够有效提高识别准确率。
在一些实施例中,可以提供人脸图像模板,该人脸图像模板中标识有各人脸部位的位置。对于人脸检测,电子设备能够检测出人脸图像中左眼、右眼、鼻子、左嘴角和右嘴角的坐标位置,然后根据五个关键点的坐标位置与人脸图像模板中人脸部位的位置之间的映射关系,对第一样本人脸图像进行裁剪(crop),得到预处理后的第一样本人脸图像。该裁剪过程可以理解为,将第一样本人脸图像中的人脸通过仿射变换对齐到统一的模板位置,并裁剪成固定大小。
该预处理过程可以通过人脸预处理的相关算法实现,例如,该预处理过程可以采用MTCNN(Multi-Task Convolutional Neural Network,多任务卷积神经网络)算法完成。
当然,该预处理过程还可以包括其他方式,例如,电子设备还可以对第一样本人脸图像进行异常值处理、灰度级变换等,本申请实施例对此不作限定。
6012、对预处理后的第一样本人脸图像进行特征提取。
电子设备对第一样本人脸图像进行预处理后,将预处理后的第一样本人脸图像中的图像像素信息转换为图像特征,该图像特征用于表示图像像素信息,图像相邻像素之间的关系等。
在一些实施例中,该特征提取过程可以通过卷积神经网络(Convolutional Neural Networks,CNN)实现,电子设备可以将预处理后的第一样本人脸图像输入卷积神经网络中,通过卷积处理,得到该第一整体图像特征。在一些实施例中,该卷积神经网络也能够进行上述预处理过程。在另一些实施例中,上述预处理过程通过另一卷积神经网络进行,本申请实施例对此不作限定。
在一些实施例中,该图像特征表达为(C,H,W),其中,C为通道,H为高,W为宽。响应于该图像特征是通过卷积神经网络得到的,可以称该图像特征为卷积特征。其中,该卷积神经网络可以包括多个卷积层,对于预处理后的第一样本人脸图像,能够通过多层卷积操作,得到表达能力十分强大的卷积特征(也即是图像特征)。C与该卷积神经网络的最后一 个卷积层的输出通道数一致。在一些实施例中,该卷积神经网络可以采用任一种能够进行准确特征提取的框架,例如,可以采用LResnet50E-IR框架,当然,也可以采用其他框架,比如GoogLeNet框架。本申请实施例对该卷积神经网络的框架不作限定。
在一些实施例中,以预处理和特征提取采用该卷积神经网络实现为例,电子设备可以先基于未遮挡人脸图像对该卷积神经网络进行预训练,预训练后,再基于步骤600获取到的第一样本人脸图像对图像处理模型的模型参数进行微调。在一些实施例中,该步骤601之前,电子设备可以基于第二样本人脸图像对该卷积神经网络进行训练,该第二样本人脸图像中人脸未被遮挡。
通过干净的人脸图像对该卷积神经网络进行预训练,这样该卷积神经网络具备对未遮挡人脸图像进行处理的先验知识,后续再针对未遮挡和带遮挡的人脸图像微调图像处理模型的模型参数,该图像处理模型对图像处理的效果会更好。
在一些实施例中,该图像处理模型的结构可以如图9所示,该图像处理模型包括卷积神经网络901、解码器902、识别网络903和遮挡模式预测网络904。该卷积神经网络901用于执行该步骤601。该解码器902用于执行下述步骤602,也即是第一遮挡指示信息的获取步骤。该识别网络903用于执行下述步骤603,也即是基于步骤601得到的图像特征与步骤602得到的第一遮挡指示信息进行人脸识别,得到该第一样本人脸图像的预测识别结果。该遮挡模式预测网络904用于执行下述步骤605,也即是基于步骤602得到的第一遮挡指示信息,对该第一样本人脸图像的遮挡模式进行分类,得到预测遮挡模式。
602、电子设备基于该第一样本人脸图像的图像特征,确定对应的第一遮挡指示信息,该第一遮挡指示信息用于指示该第一样本人脸图像的人脸遮挡区域的图像特征。
电子设备获取到第一整体图像特征后,该第一整体图像特征中有些图像特征受人脸遮挡区域影响可能会对人脸识别造成干扰,因而,电子设备需要分析哪些图像特征受到了人脸遮挡区域影响,从而执行下述步骤603,将这部分图像特征的影响去除,提高人脸识别的准确率。
在一些实施例中,第一遮挡指示信息可以为特征向量的形式,特征向量中的每一位元素的值,用于指示每个图像特征元素是否受人脸遮挡区域的影响。例如,该每一位元素的值,用于表示对应的图像特征元素受人脸遮挡区域影响的概率。在一些实施例中,该第一遮挡指示信息可以采用掩码的形式,该第一遮挡指示信息可以称为特征掩码。
在一些实施例中,确定该第一遮挡指示信息可以为一个分类过程,对图像特征进行进一步处理,再对处理后的图像特征进行分类,得到第一遮挡指示信息。在一些实施例中,电子设备对该第一整体图像特征进行卷积处理,对卷积处理后的图像特征进行分类,确定该第一整体图像特征对应的第一遮挡指示信息。
在一些实施例中,该第一遮挡指示信息的确定过程通过解码器实现,在该第一遮挡指示信息采用掩码的形式时,该解码器还可以称为掩码解码器,该掩码解码器用于将图像特征(也可以称为卷积特征)映射为对应的特征掩码。
在一些实施例中,该图像处理模型1000的结构可以如图10所示,其中,解码器(Decoder)1001包括Conv(Convolution,卷积)层、PRelu(Parametric Rectified Linear Unit,线性整流函数)层、BN(Batch Normalization,批标准化)层和Sigmoid(S型生长曲线)层。该解码器1001能够先对图像特征进行卷积处理,然后对卷积结果进行线性整流处理,再进行批量标准化处理,通过Sigmoid层,预测每个图像特征保留(也即是不被去除,不受人脸遮挡区域影响)的概率,得到第一遮挡指示信息(也即是特征掩码)。可以理解地,通过Sigmoid层能够将图像特征映射到[0,1]之间。每个图像特征保留的概率与每个图像特征受人脸遮挡区域影响的概率负相关。通过Sigmoid层预测的过程,实质为预测每个图像特征受人脸遮挡区域影响的概率,其中,每个图像特征受人脸遮挡区域影响的概率越大,其保留的概率越小,其在第一遮挡指示信息中对应位的数值越小,越接近于0。相反地,每个图像特征受人脸遮挡区域 影响的概率越小,其保留的概率越大,其在第一遮挡指示信息中对应位的数值越大,越接近于1。该解码器1001的具体结构可以如图11所示。
该解码器1001从前面卷积网络生成的特征X 1中解码出相应的特征掩码M 1。M 1的功能就是找到X 1中被污染的特征元素,通过两者相乘将这些元素去除得到干净的特征X′ 1,该特征用于后续的识别任务。
该步骤601和步骤602是基于图像处理模型,获取第一样本人脸图像的图像特征以及该图像特征对应的第一遮挡指示信息的过程,上述过程中对特征提取和确定第一遮挡指示信息的方式进行了说明,在一些实施例中,该第一遮挡指示信息也可以不基于该图像特征确定,而是直接对第一样本人脸图像进行处理,确定对应的第一遮挡指示信息。本申请实施例对此不作限定。
603、电子设备基于该第一样本人脸图像的图像特征和该第一遮挡指示信息进行人脸识别,得到该第一样本人脸图像的预测识别结果。
电子设备在确定出第一遮挡指示信息后,即获知该第一整体图像特征中哪些图像特征受人脸遮挡区域影响,从而可以将这部分图像特征去除后再进行人脸识别,这样识别结果不受人脸遮挡区域的影响,则会更加准确。
在一些实施例中,步骤603通过6031和6032实现。
6031、基于该第一遮挡指示信息,去除该第一整体图像特征中人脸遮挡区域的图像特征,得到第一目标图像特征。
该第一遮挡指示信息已经指示受人脸遮挡区域影响的图像特征,通过第一遮挡指示信息能够对该第一整体图像特征进行处理,将受影响的图像特征去除,这样可以去除人脸遮挡影响,从而进行准确的人脸识别过程。
在一些实施例中,该去除过程可以为:电子设备将该第一整体图像特征与该第一遮挡指示信息相乘,得到该第一目标图像特征。该第一遮挡指示信息可以采用矩阵或向量的形式,如果某个图像特征受影响大,则该第一遮挡指示信息中该图像特征对应位的数值比较小,在相乘后该图像特征对应的数值则被变小,这样第一目标图像特征中受人脸遮挡区域影响的图像特征即被削弱,几乎无法体现,达到了去除效果。
6032、根据该第一目标图像特征,对该第一样本人脸图像中的人脸进行识别,得到预测识别结果。
电子设备在得到第一目标图像特征后,即可进行人脸识别,确定第一样本人脸图像的识别结果。在一些实施例中,该人脸识别过程可以为分类过程,通过分类确定该人脸的身份,或者,通过分类确定人脸属性或者人脸类型等。
在一些实施例中,该分类过程为获取第一目标图像特征与候选人脸图像特征之间的匹配度,将匹配度最大的候选人脸图像特征对应的识别结果确定为该预测识别结果。
例如,通过全连接层提取特征向量后,计算测试人脸的特征向量f p(也即是第一目标图像特征)与数据库中各人脸特征向量
Figure PCTCN2021112829-appb-000001
的余弦相似度:
Figure PCTCN2021112829-appb-000002
一般来说,人脸识别有两个场景,一个是人脸鉴别场景,一个是人脸认证场景。
根据不同的识别场景,该识别过程可以不同。对于人脸鉴别场景,需要识别出测试人脸属于数据库中哪个人脸类别。本方案采用最近邻分类器,即数据库中与测试人脸相似度最高的人脸的类别,即为该测试人脸所属类别。也可以采用其他分类器,例如支持向量机(Support Vector Machines,SVM)等。
对于人脸认证场景,需要识别出测试人脸与数据库中人脸或者说另一比对人脸是否属于同一类。本方案采用阈值判断,即二者的相似度高于某阈值时认为是同一个人,反之认为不是同一个人。也可以根据特征向量专门学习一个用于人脸认证的分类器。
上述步骤603为获取第一样本人脸图像的预测识别结果的过程,除此之外,该图像处理 模型还可以通过其他处理方式获取该预测识别结果,例如,基于提取到的图像特征直接进行人脸识别,本申请实施例对此不作具体限定。
604、电子设备基于该预测识别结果和该第一样本人脸图像对应的目标识别结果,获取识别误差。
电子设备确定出预测识别结果后,可以与目标识别结果对比,确定二者之间的差距,该差距即为该识别误差。
在一些实施例中,该识别误差可以通过损失函数获取,该损失函数可以为任一种损失函数,例如,CosFace分类损失函数、交叉熵损失函数、L1、L2等距离回归损失函数或指数损失函数等。在一些实施例中,该识别误差可以通过CosFace分类损失函数获取得到。本申请实施例对识别误差的获取方式不作具体限定。
605、电子设备基于该第一样本人脸图像的第一遮挡指示信息,获取该第一样本人脸图像的第一遮挡区域信息。
上述步骤602之后,电子设备还可以根据该第一遮挡指示信息,预测第一样本人脸图像的遮挡模式,在预测遮挡模式时,需要通过遮挡区域信息匹配实现。第一遮挡区域信息用于指示第一样本人脸图像中每个区域是否被遮挡。换一种表达方式,第一遮挡区域信息用于指示第一样本人脸图像中多个图像块的被遮挡情况。
606、电子设备将第一遮挡区域信息与至少两个候选遮挡模式对应的遮挡区域信息进行匹配,得到至少两个匹配度;根据该至少两个匹配度,在至少两个候选遮挡模式中确定预测遮挡模式。
在一些实施例中,步骤606可以将匹配度最大的候选遮挡模式,确定为该第一样本人脸图像的预测遮挡模式。
通过对比该第一样本人脸图像的遮挡区域信息与候选遮挡模式的遮挡区域信息,确定出该第一遮挡指示信息更符合哪一种候选遮挡模式,即可将其作为该预测遮挡模式。
可以设置至少两个候选遮挡模式,也即是有多个候选遮挡模式,每个候选遮挡模式对应有遮挡区域信息。该遮挡区域信息可以在划分遮挡模式时建立。例如,针对遮挡区域信息,如果一个区域被遮挡,则该区域对应位的值可以设置为0;如果该区域没有被遮挡,则该区域对应位的值可以设置为1。如图7中的(a)所示,未遮挡人脸图像701的遮挡区域信息中每一位元素的值均可以为1,在此以黑色表示。如图7中的(b)所示,遮挡人脸图像702的遮挡区域信息中属于遮挡区域的位置上数值为1,在此以黑色表示,属于未遮挡区域的位置上数值为0,在此以白色表示。
在一些实施例中,该至少两个候选遮挡模式可以存储于遮挡模式库中,在匹配时,电子设备能够将本次需要匹配的遮挡区域信息与该遮挡模式库中的数据进行匹配。
在一些实施例中,该遮挡模式的确定过程可以通过遮挡模式预测器(Occlusion Pattern Predictor)实现,也即是上述图9所示的遮挡模式预测网络或图10所示的遮挡模式预测器1002,该遮挡模式预测器1002也即是遮挡模式预测网络。在一些实施例中,该遮挡模式预测网络可以采用“BN-FC-BN”的顺序结构,也即是,该遮挡模式预测网络能够先对第一遮挡指示信息进行标准化处理,再对标准化处理后的信息进行卷积处理,卷积处理后再进行标准化,得到预测遮挡模式。该遮挡模式预测网络输出的数据的维度与遮挡模式的数量相同。也即是,该预测遮挡模式能够采用多维向量的形式。维度与遮挡模式的数量相同。例如,遮挡模式的数量为101,则该预测遮挡模式可以采用101维度的向量表示。向量中每一位元素的数值用于表征该第一样本人脸图像的遮挡模式为该元素对应的候选遮挡模式的概率。
该步骤606为基于该第一样本人脸图像的第一遮挡指示信息,对该第一样本人脸图像的遮挡模式进行分类,得到预测遮挡模式的过程,上述过程中通过将第一遮挡指示信息转换为第一遮挡区域信息,进而通过遮挡区域信息匹配,来确定预测遮挡模式。在一些实施例中,电子设备中也可以设置有至少两个候选遮挡模式的遮挡指示信息,直接将该第一遮挡指示信 息与候选遮挡模式的遮挡指示信息进行匹配,本申请实施例对采用哪种方式不作具体限定。
607、电子设备基于该预测遮挡模式和该第一样本人脸图像对应的目标遮挡模式,获取分类误差。
该分类误差用于衡量预测遮挡模式和该目标遮挡模式之间的差距,该分类误差的获取过程与上述步骤605同理,可以通过损失函数获取。
在一些实施例中,该分类误差L pred可以采用交叉熵损失函数确定。例如,通过下述公式(1)获取分类误差:
Figure PCTCN2021112829-appb-000003
其中,N是参与训练的第一样本人脸图像的总数量,C是遮挡模式的总数量,p i是第一样本人脸图像x i被正确分类的概率,f i是第一样本人脸图像x i相应的特征向量。i和j是标识,i和j的取值均为正整数。
608、电子设备根据该识别误差和该分类误差,对该图像处理模型的模型参数进行更新。
电子设备获取到两种误差后,可以综合两种误差对模型参数进行更新,能够既考虑到该图像处理模型人脸识别的鲁棒性和准确性,也考虑到该图像处理模型确定第一遮挡指示信息的鲁棒性和准确性,这样训练出来的模型在两方面的性能均能有所提升。
结合该两种误差的更新过程可以包括两种方式,本申请实施例能够采用任一种方式实现更新步骤。下面提供两种可选方式。
方式一、电子设备获取该分类误差和该分类误差的权重的乘积,将该乘积与该识别误差之和作为目标误差,基于该目标误差,对该图像处理模型的模型参数进行更新。
在方式一中,可以为分类误差设置权重,该分类误差的权重可以由相关技术人员根据需求进行设置,该分类误差的权重可以为该模型的超参数,还可以为之前训练模型得到的经验值,例如,该权重可以设置为1,在另一些实施例中,该权重还可以在本次模型训练中与模型参数一起进行更新得到,本申请实施例对此不作限定。
例如,该目标误差L total的获取过程通过下述公式(2)实现:
L total=L cls+w*L pred  (2)
其中,L cls是人脸识别的损失函数(比如采用的是CosFace分类损失函数);而L pred是公式(1)所定义的预测遮挡模式的损失函数。w是一个权重系数,该权重系数用于平衡两种损失函数在训练过程中重要性。通过交叉验证发现,w的取值为1.0时识别效果最好。
方式二、基于该分类误差和该识别误差各自的权重,对该分类误差和该识别误差进行加权求和,得到目标误差,基于该目标误差,对该图像处理模型的模型参数进行更新。
在方式二中,每种误差均设置有权重,该权重的设置与方式一中同理,在此不多做赘述。
通过上述方法训练得到图像处理模型后,该图像处理模型能够提供图像处理功能。在一些实施例中,电子设备响应于图像处理指令,基于该图像处理模型,对待识别的目标人脸图像进行特征提取,基于提取到的第二整体图像特征与该第二整体图像特征对应的第二遮挡指示信息进行人脸识别,得到该目标人脸图像的图像识别结果。在一些实施例中,该图像处理模型处理图像的具体流程可以参见下述图14所示的实施例。
下面提供一个具体示例,模型使用过程可以如图12所示,电子设备能够执行输入待识别图片的步骤1201,然后基于人脸预处理模块进行人脸检测和对齐,即步骤1202,通过上述步骤1202能够得到预处理后的人脸图像,电子设备可以继续基于深度卷积网络(CNN)提取卷积特征(也即是图像特征),即步骤1203。提取到卷积特征后,基于该卷积特征可以进行两个步骤,在步骤1204中,电子设备能够基于掩码解码器(Mask Decoder)生成相应掩码,再在步骤1205中,基于生成的掩码和卷积特征,通过相乘运算,去除被污染的特征元素。在步骤1206中,电子设备再基于全连接网络(FC)来得到最终人脸特征用于识别,最终在步骤 1207中,电子设备能够输出测试人脸类别或是否属于同一类。对于模型训练过程还可以如图13所示,训练过程可以包括两个步骤,步骤1中,通过普通的人脸数据训练深度卷积网络,步骤2中,在第一步训练好模型的基础上,再次利用混合的人脸数据微调整个网络参数。
本申请实施例引入了遮挡模式,通过人脸识别过程中产生的第一遮挡指示信息,确定该第一样本人脸图像的预测遮挡模式,并与该第一样本人脸图像所标注的目标遮挡模式做对比,以此能够训练图像处理模型输出更准确的第一遮挡指示信息,进而基于准确的第一遮挡指示信息进行人脸识别,得到的识别结果也就更准确,换言之,该图像处理模型能够更准确地处理存在遮挡的人脸图像,也即是该图像处理模型的鲁棒性更好。另一方面,该图像处理模型能够直接对第一样本人脸图像进行特征提取,再基于提取到的图像特征以及对应的第一遮挡指示信息进行人脸识别,无需借助外部网络,能够端到端地进行图像处理,因此显著地减少了计算量,提升了设备的运行速度,也能够减少模型的个数,且由于该图像处理模型处理图像的准确性不受外部网络因素影响,因此准确性得到了显著提升。
上述图3和图6所示实施例对图像处理模型的训练过程进行了说明,该图像处理模型在训练完成后,能够用于图像处理,图像处理流程可以如下述图14所示。图14是本申请实施例提供的一种图像处理方法的流程图,参见图14,该方法包括:
1401、电子设备对待识别的目标人脸图像进行特征提取,得到该目标人脸图像的第二整体图像特征。
该步骤1401与上述步骤601中获取第一整体图像特征的过程同理,在此不多做赘述。
在一些实施例中,电子设备首先对该目标人脸图像进行预处理,再对预处理后的目标人脸图像进行特征提取,得到该目标人脸图像的第二整体图像特征。
在一些实施例中,该预处理过程可以为:电子设备对该目标人脸图像进行人脸检测,基于人脸检测结果,对该目标人脸图像进行裁剪,得到预处理后的目标人脸图像。
1402、电子设备确定该第二整体图像特征对应的第二遮挡指示信息,该第二遮挡指示信息用于指示目标人脸图像的人脸遮挡区域的图像特征。
该步骤1402与上述步骤602同理,在此不多做赘述。
在一些实施例中,该第二遮挡指示信息的确定过程可以为:电子设备对该第二整体图像特征进行卷积处理,对卷积处理后的图像特征进行分类,得到该第二遮挡指示信息。
1403、电子设备根据该目标人脸图像的第二整体图像特征和该第二遮挡指示信息,获取第二目标图像特征。
1404、电子设备基于该第二目标图像特征,对该目标人脸图像中的人脸进行识别。
该步骤1403和步骤1404与上述步骤603中的6011和6012同理,在此不多做赘述。
在一些实施例中,上述步骤1403可以为:电子设备基于该第二遮挡指示信息,去除该第二整体图像特征中该人脸遮挡区域的图像特征,得到第二目标图像特征。
在一些实施例中,去除的方式可以通过相乘实现,电子设备将该第二整体图像特征与该第二遮挡指示信息相乘,得到该第二目标图像特征。
在一些实施例中,上述图像处理方法可以通过图像处理模型实现,电子设备可以将该目标人脸图像输入图像处理模型中,由该图像处理模型执行该特征提取、确定第二遮挡指示信息、获取第二目标图像特征以及人脸识别过程,输出识别结果。
上述所有可选技术方案,能够采用任意结合形成本申请的可选实施例,在此不再一一赘述。
图15是本申请实施例提供的一种图像处理模型训练装置的结构示意图,参见图15,该装置包括:
第一获取模块1501,用于基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,所述第一遮挡指示信息用于指示所述第一样本人脸图像的人脸遮挡区域 的图像特征;
第二获取模块1502,用于基于所述预测识别结果和所述第一样本人脸图像对应的目标识别结果,获取识别误差;
第三获取模块1503,用于基于所述第一遮挡指示信息和所述第一样本人脸图像对应的目标遮挡模式,获取分类误差,其中,所述第一样本人脸图像的遮挡模式用于指示所述人脸遮挡区域的位置以及尺寸;
更新模块1504,用于根据该识别误差和该分类误差,对该图像处理模型的模型参数进行更新。
在一些实施例中,该第三获取模块1503用于:
基于所述第一遮挡指示信息,确定所述第一样本人脸图像的预测遮挡模式;
基于所述预测遮挡模式和所述目标遮挡模式,获取所述分类误差。
在一些实施例中,该第三获取模块1503用于:
基于所述第一遮挡指示信息,获取第一遮挡区域信息,所述第一遮挡区域信息用于指示所述第一样本人脸图像中多个图像块的被遮挡情况;
将所述第一遮挡区域信息与至少两个候选遮挡模式对应的遮挡区域信息进行匹配,得到至少两个匹配度;
根据所述至少两个匹配度,在所述至少两个候选遮挡模式中确定所述预测遮挡模式。
在一些实施例中,该更新模块1504用于执行下述任一项:
获取所述分类误差和所述分类误差的权重的乘积,将所述乘积与所述识别误差之和作为目标误差,基于所述目标误差,对所述图像处理模型的模型参数进行更新;
基于所述分类误差和所述识别误差各自的权重,对所述分类误差和所述识别误差进行加权求和,得到目标误差,基于所述目标误差,对所述图像处理模型的模型参数进行更新。
在一些实施例中,该第一获取模块1501包括:第一获取单元和识别单元;
该第一获取单元,用于基于所述图像处理模型,获取所述第一样本人脸图像的第一整体图像特征以及所述第一整体图像特征对应的第一遮挡指示信息;
该识别单元,用于基于所述第一整体图像特征和所述第一遮挡指示信息进行人脸识别,得到所述第一样本人脸图像的预测识别结果。
在一些实施例中,该第一获取单元包括:特征提取子单元和确定子单元;
该特征提取子单元,用于基于所述图像处理模型,对第一样本人脸图像进行特征提取,得到所述第一整体图像特征;
该确定子单元,用于确定所述第一整体图像特征对应的第一遮挡指示信息。
在一些实施例中,该特征提取子单元用于:
对所述第一整体图像特征进行卷积;
对卷积后的图像特征进行分类,得到所述第一遮挡指示信息。
在一些实施例中,该识别单元包括:去除子单元和识别子单元;
该去除子单元,用于基于所述第一遮挡指示信息,去除所述第一整体图像特征中所述人脸遮挡区域的图像特征,得到第一目标图像特征;
该识别子单元,用于根据所述第一目标图像特征,对所述第一样本人脸图像进行人脸识别,得到所述预测识别结果。
在一些实施例中,该去除子单元,用于将所述第一整体图像特征与所述第一遮挡指示信息进行相乘,得到所述第一目标图像特征。
在一些实施例中,该第一获取单元还包括预处理子单元;
该预处理子单元,用于基于所述图像处理模型,对所述第一样本人脸图像进行预处理;
该特征提取子单元和该确定子单元,用于基于预处理后的第一样本人脸图像,获取所述第一整体图像特征以及所述第一遮挡指示信息。
在一些实施例中,该预处理子单元用于:
基于所述图像处理模型,对所述第一样本人脸图像进行人脸检测;
基于人脸检测结果,对所述第一样本人脸图像进行裁剪,得到预处理后的第一样本人脸图像像。
在一些实施例中,该图像处理模型包括卷积神经网络、解码器、识别网络和遮挡模式预测网络;
所述卷积神经网络用于执行所述预处理和所述第一整体图像特征的获取步骤;
所述解码器用于执行所述第一遮挡指示信息的获取步骤;
所述识别网络用于基于所述第一整体图像特征和所述第一遮挡指示信息进行人脸识别,得到所述第一样本人脸图像的预测识别结果;
所述遮挡模式预测网络用于基于所述第一遮挡指示信息,确定所述第一样本人脸图像的预测遮挡模式。
在一些实施例中,该装置还包括训练模块,该训练模块,用于基于第二样本人脸图像对所述卷积神经网络进行训练,所述第二样本人脸图像中人脸未被遮挡。
在一些实施例中,该装置还包括识别模块,该识别模块,用于响应于图像处理指令,基于该图像处理模型,对待识别的目标人脸图像进行特征提取,基于提取到的第二整体图像特征与该第二整体图像特征对应的第二遮挡指示信息进行人脸识别,得到该目标人脸图像的图像识别结果。
本申请实施例引入了遮挡模式,通过人脸识别过程中产生的第一遮挡指示信息,确定该第一样本人脸图像的预测遮挡模式,并与该第一样本人脸图像对应的目标遮挡模式做对比,以此能够训练图像处理模型确定出更准确的第一遮挡指示信息,进而基于准确的第一遮挡指示信息进行人脸识别,得到的识别结果也就更准确,该图像处理模型能够更准确地处理存在遮挡的人脸图像,也即是该图像处理模型的鲁棒性更好。换言之,该图像处理模型能够直接对第一样本人脸图像进行处理得到识别结果,无需借助外部网络,能够端到端地进行图像处理,因此显著地减少了计算量,提升了设备的运行速度,也能够有效减少模型的个数,且由于该图像处理模型处理图像的准确性不受外部网络因素影响,因此准确性得到了显著提升。
需要说明的是:上述实施例提供的图像处理模型训练装置在图像处理模型训练时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将图像处理模型训练装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的图像处理模型训练装置与图像处理模型训练方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
上述方法实施例中的电子设备能够实现为终端。例如,图16是本申请实施例提供的一种终端的结构示意图。该终端1600可以是便携式移动终端,比如:智能手机、平板电脑、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1600还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1600包括有:处理器1601和存储器1602。
处理器1601可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1601可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1601也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器 1601可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1601还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1602可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1602中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1601所执行以实现本申请中方法实施例提供的图像处理模型训练方法或图像处理方法。
本领域技术人员可以理解,图16中示出的结构并不构成对终端1600的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
上述方法实施例中的电子设备能够实现为服务器。例如,图17是本申请实施例提供的一种服务器的结构示意图,该服务器1700可因配置或性能不同而产生比较大的差异,能够包括一个或一个以上处理器(Central Processing Units,CPU)1701和一个或一个以上的存储器1702,其中,该存储器1702中存储有至少一条程序代码,该至少一条程序代码由该处理器1701加载并执行以实现上述各个方法实施例提供的图像处理模型训练方法或图像处理方法。当然,该服务器还能够具有有线或无线网络接口以及输入输出接口等部件,以便进行输入输出,该服务器还能够包括其他用于实现设备功能的部件,在此不做赘述。
在一些实施例中,还提供了一种计算机可读存储介质,例如包括至少一条程序代码的存储器,上述至少一条程序代码由可由处理器执行以完成上述实施例中的图像处理模型训练方法或图像处理方法。例如,计算机可读存储介质能够是只读存储器(Read-Only Memory,简称:ROM)、随机存取存储器(Random Access Memory,简称:RAM)、只读光盘(Compact Disc Read-Only Memory,简称:CD-ROM)、磁带、软盘和光数据存储设备等。
在一些实施例中,还提供一种计算机程序产品或计算机程序,该计算机程序产品或该计算机程序包括一条或多条程序代码,该一条或多条程序代码存储在计算机可读存储介质中。电子设备的一个或多个处理器能够从计算机可读存储介质中读取该一条或多条程序代码,该一个或多个处理器执行该一条或多条程序代码,使得电子设备能够执行上述图像处理模型训练方法或图像处理方法。

Claims (20)

  1. 一种图像处理模型训练方法,由电子设备执行,所述方法包括:
    基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,所述第一遮挡指示信息用于指示所述第一样本人脸图像的人脸遮挡区域的图像特征;
    基于所述预测识别结果和所述第一样本人脸图像对应的目标识别结果,获取识别误差;
    基于所述第一遮挡指示信息和所述第一样本人脸图像对应的目标遮挡模式,获取分类误差,其中,所述第一样本人脸图像的遮挡模式用于指示所述人脸遮挡区域的位置以及尺寸;
    根据所述识别误差和所述分类误差,对所述图像处理模型的模型参数进行更新。
  2. 根据权利要求1所述的方法,其中,所述基于所述第一遮挡指示信息和所述第一样本人脸图像对应的目标遮挡模式,获取分类误差,包括:
    基于所述第一遮挡指示信息,确定所述第一样本人脸图像的预测遮挡模式;
    基于所述预测遮挡模式和所述目标遮挡模式,获取所述分类误差。
  3. 根据权利要求2所述的方法,其中,所述基于所述第一遮挡指示信息,确定所述第一样本人脸图像的预测遮挡模式,包括:
    基于所述第一遮挡指示信息,获取第一遮挡区域信息,所述第一遮挡区域信息用于指示所述第一样本人脸图像中多个图像块的被遮挡情况;
    将所述第一遮挡区域信息与至少两个候选遮挡模式对应的遮挡区域信息进行匹配,得到至少两个匹配度;
    根据所述至少两个匹配度,在所述至少两个候选遮挡模式中确定所述预测遮挡模式。
  4. 根据权利要求1所述的方法,其中,所述根据所述识别误差和所述分类误差,对所述图像处理模型的模型参数进行更新,包括下述任一项:
    获取所述分类误差和所述分类误差的权重的乘积,将所述乘积与所述识别误差之和作为目标误差,基于所述目标误差,对所述图像处理模型的模型参数进行更新;
    基于所述分类误差和所述识别误差各自的权重,对所述分类误差和所述识别误差进行加权求和,得到目标误差,基于所述目标误差,对所述图像处理模型的模型参数进行更新。
  5. 根据权利要求1所述的方法,其中,所述基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,包括:
    基于所述图像处理模型,获取所述第一样本人脸图像的第一整体图像特征以及所述第一整体图像特征对应的第一遮挡指示信息;
    基于所述第一整体图像特征和所述第一遮挡指示信息进行人脸识别,得到所述第一样本人脸图像的预测识别结果。
  6. 根据权利要求5所述的方法,其中,所述基于所述图像处理模型,获取第一样本人脸图像的第一整体图像特征以及所述第一整体图像特征对应的第一遮挡指示信息,包括:
    基于所述图像处理模型,对第一样本人脸图像进行特征提取,得到所述第一整体图像特征;
    确定所述第一整体图像特征对应的第一遮挡指示信息。
  7. 根据权利要求6所述的方法,其中,所述确定所述第一整体图像特征对应的第一遮挡 指示信息,包括:
    对所述第一整体图像特征进行卷积;
    对卷积后的图像特征进行分类,得到所述第一遮挡指示信息。
  8. 根据权利要求5所述的方法,其中,所述基于所述第一整体图像特征和所述第一遮挡指示信息进行人脸识别,得到所述第一样本人脸图像的预测识别结果,包括:
    基于所述第一遮挡指示信息,去除所述第一整体图像特征中所述人脸遮挡区域的图像特征,得到第一目标图像特征;
    根据所述第一目标图像特征,对所述第一样本人脸图像进行人脸识别,得到所述预测识别结果。
  9. 根据权利要求5所述的方法,其中,所述基于所述图像处理模型,获取所述第一样本人脸图像的第一整体图像特征以及所述第一整体图像特征对应的第一遮挡指示信息,包括:
    基于所述图像处理模型,对所述第一样本人脸图像进行预处理;
    基于预处理后的第一样本人脸图像,获取所述第一整体图像特征以及所述第一遮挡指示信息。
  10. 根据权利要求9所述的方法,其中,所述基于所述图像处理模型,对所述第一样本人脸图像进行预处理,包括:
    基于所述图像处理模型,对所述第一样本人脸图像进行人脸检测;
    基于人脸检测结果,对所述第一样本人脸图像进行裁剪,得到预处理后的第一样本人脸图像。
  11. 根据权利要求9所述的方法,其中,所述图像处理模型包括卷积神经网络、解码器、识别网络和遮挡模式预测网络;
    所述卷积神经网络用于执行所述预处理和所述第一整体图像特征的获取步骤;
    所述解码器用于执行所述第一遮挡指示信息的获取步骤;
    所述识别网络用于基于所述第一整体图像特征和所述第一遮挡指示信息进行人脸识别,得到所述第一样本人脸图像的预测识别结果;
    所述遮挡模式预测网络用于基于所述第一遮挡指示信息,确定所述第一样本人脸图像的预测遮挡模式。
  12. 根据权利要求11所述的方法,其中,所述方法还包括:
    基于第二样本人脸图像对所述卷积神经网络进行训练,所述第二样本人脸图像中人脸未被遮挡。
  13. 根据权利要求8所述的方法,其中,所述基于所述第一遮挡指示信息,去除所述第一整体图像特征中所述人脸遮挡区域的图像特征,得到第一目标图像特征,包括:
    将所述第一整体图像特征与所述第一遮挡指示信息进行相乘,得到所述第一目标图像特征。
  14. 一种图像处理方法,由电子设备执行,所述方法包括:
    响应于图像处理指令,对待识别的目标人脸图像进行特征提取,得到所述目标人脸图像的第二整体图像特征;
    确定所述第二整体图像特征对应的第二遮挡指示信息,所述第二遮挡指示信息用于指示所述目标人脸图像的人脸遮挡区域的图像特征;
    根据所述第二遮挡指示信息,去除所述第二整体图像特征中所述人脸遮挡区域的图像特征,得到第二目标图像特征;
    基于所述第二目标图像特征,对所述目标人脸图像进行人脸识别。
  15. 根据权利要求14所述的方法,其中,所述方法包括:
    将所述目标人脸图像输入图像处理模型中,由所述图像处理模型执行所述特征提取、确定第二遮挡指示信息、获取第二目标图像特征以及人脸识别过程,输出识别结果。
  16. 一种图像处理模型训练装置,其中,所述装置包括:
    第一获取模块,用于基于图像处理模型,获取第一样本人脸图像的预测识别结果和第一遮挡指示信息,所述第一遮挡指示信息用于指示所述第一样本人脸图像的人脸遮挡区域的图像特征;
    第二获取模块,用于基于所述预测识别结果和所述第一样本人脸图像对应的目标识别结果,获取识别误差;
    第三获取模块,用于基于所述第一遮挡指示信息和所述第一样本人脸图像对应的目标遮挡模式,获取分类误差,其中,所述第一样本人脸图像的遮挡模式用于指示所述人脸遮挡区域的位置以及尺寸;
    更新模块,用于根据所述识别误差和所述分类误差,对所述图像处理模型的模型参数进行更新。
  17. 一种电子设备,所述电子设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现如权利要求1至13中任一项所述的图像处理模型训练方法。
  18. 一种电子设备,所述电子设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现如权利要求14或15所述的图像处理方法。
  19. 一种计算机可读存储介质,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求1至13中任一项所述的图像处理模型训练方法。
  20. 一种计算机可读存储介质,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求14或15所述的图像处理方法。
PCT/CN2021/112829 2020-08-20 2021-08-16 图像处理模型训练方法、装置、设备及存储介质 WO2022037541A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21857635.3A EP4099217A4 (en) 2020-08-20 2021-08-16 IMAGE PROCESSING MODEL TRAINING METHOD AND DEVICE, DEVICE AND STORAGE MEDIUM
US17/961,345 US20230033052A1 (en) 2020-08-20 2022-10-06 Method, apparatus, device, and storage medium for training image processing model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010845864.9A CN111914812B (zh) 2020-08-20 2020-08-20 图像处理模型训练方法、装置、设备及存储介质
CN202010845864.9 2020-08-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/961,345 Continuation US20230033052A1 (en) 2020-08-20 2022-10-06 Method, apparatus, device, and storage medium for training image processing model

Publications (1)

Publication Number Publication Date
WO2022037541A1 true WO2022037541A1 (zh) 2022-02-24

Family

ID=73278486

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112829 WO2022037541A1 (zh) 2020-08-20 2021-08-16 图像处理模型训练方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20230033052A1 (zh)
EP (1) EP4099217A4 (zh)
CN (1) CN111914812B (zh)
WO (1) WO2022037541A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914812B (zh) * 2020-08-20 2022-09-16 腾讯科技(深圳)有限公司 图像处理模型训练方法、装置、设备及存储介质
CN112434807B (zh) * 2020-11-24 2023-04-07 上海鹰瞳医疗科技有限公司 基于眼底图像的深度学习模型性能验证方法及设备
CN112633183B (zh) * 2020-12-25 2023-11-14 平安银行股份有限公司 影像遮挡区域自动检测方法、装置及存储介质
CN113486785A (zh) * 2021-07-01 2021-10-08 深圳市英威诺科技有限公司 基于深度学习的视频换脸方法、装置、设备及存储介质
CN113657462A (zh) * 2021-07-28 2021-11-16 讯飞智元信息科技有限公司 用于训练车辆识别模型的方法、车辆识别方法和计算设备
CN115249281B (zh) * 2022-01-29 2023-11-24 北京百度网讯科技有限公司 图像遮挡和模型训练方法、装置、设备以及存储介质
CN117372705A (zh) * 2022-06-28 2024-01-09 脸萌有限公司 模型训练方法、装置及电子设备
CN115810214B (zh) * 2023-02-06 2023-05-12 广州市森锐科技股份有限公司 基于ai人脸识别核验管理方法、系统、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095856A (zh) * 2015-06-26 2015-11-25 上海交通大学 基于掩膜的有遮挡人脸识别方法
US20160189006A1 (en) * 2014-12-31 2016-06-30 TCL Research America Inc. Robust error correction with multi-model representation for face recognition
CN106570464A (zh) * 2016-10-31 2017-04-19 华南理工大学 一种快速处理人脸遮挡的人脸识别方法及装置
CN110322416A (zh) * 2019-07-09 2019-10-11 腾讯科技(深圳)有限公司 图像数据处理方法、装置以及计算机可读存储介质
CN111914812A (zh) * 2020-08-20 2020-11-10 腾讯科技(深圳)有限公司 图像处理模型训练方法、装置、设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292287B (zh) * 2017-07-14 2018-09-21 深圳云天励飞技术有限公司 人脸识别方法、装置、电子设备及存储介质
CN108038474B (zh) * 2017-12-28 2020-04-14 深圳励飞科技有限公司 人脸检测方法、卷积神经网络参数的训练方法、装置及介质
CN107909065B (zh) * 2017-12-29 2020-06-16 百度在线网络技术(北京)有限公司 用于检测人脸遮挡的方法及装置
CN109063604A (zh) * 2018-07-16 2018-12-21 阿里巴巴集团控股有限公司 一种人脸识别方法及终端设备
CN111191616A (zh) * 2020-01-02 2020-05-22 广州织点智能科技有限公司 一种人脸遮挡检测方法、装置、设备及存储介质
CN111488811B (zh) * 2020-03-31 2023-08-22 长沙千视通智能科技有限公司 人脸识别方法、装置、终端设备及计算机可读介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160189006A1 (en) * 2014-12-31 2016-06-30 TCL Research America Inc. Robust error correction with multi-model representation for face recognition
CN105095856A (zh) * 2015-06-26 2015-11-25 上海交通大学 基于掩膜的有遮挡人脸识别方法
CN106570464A (zh) * 2016-10-31 2017-04-19 华南理工大学 一种快速处理人脸遮挡的人脸识别方法及装置
CN110322416A (zh) * 2019-07-09 2019-10-11 腾讯科技(深圳)有限公司 图像数据处理方法、装置以及计算机可读存储介质
CN111914812A (zh) * 2020-08-20 2020-11-10 腾讯科技(深圳)有限公司 图像处理模型训练方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4099217A4 *

Also Published As

Publication number Publication date
CN111914812B (zh) 2022-09-16
EP4099217A4 (en) 2023-06-21
US20230033052A1 (en) 2023-02-02
CN111914812A (zh) 2020-11-10
EP4099217A1 (en) 2022-12-07

Similar Documents

Publication Publication Date Title
WO2022037541A1 (zh) 图像处理模型训练方法、装置、设备及存储介质
WO2021203863A1 (zh) 基于人工智能的物体检测方法、装置、设备及存储介质
US10579860B2 (en) Learning model for salient facial region detection
CN110431560B (zh) 目标人物的搜索方法和装置、设备和介质
WO2022161286A1 (zh) 图像检测方法、模型训练方法、设备、介质及程序产品
WO2021139324A1 (zh) 图像识别方法、装置、计算机可读存储介质及电子设备
CN111738357B (zh) 垃圾图片的识别方法、装置及设备
US10580272B1 (en) Techniques to provide and process video data of automatic teller machine video streams to perform suspicious activity detection
US10692089B2 (en) User classification using a deep forest network
US11334773B2 (en) Task-based image masking
CN110245714B (zh) 图像识别方法、装置及电子设备
US20220027732A1 (en) Method and apparatus for image recognition
WO2023231753A1 (zh) 一种神经网络的训练方法、数据的处理方法以及设备
CN108229680B (zh) 神经网络系统、遥感图像识别方法、装置、设备及介质
US20210406568A1 (en) Utilizing multiple stacked machine learning models to detect deepfake content
Senthilkumar et al. Suspicious human activity detection in classroom examination
US20240037995A1 (en) Detecting wrapped attacks on face recognition
Seidenari et al. Real-time demographic profiling from face imagery with Fisher vectors
KR102060110B1 (ko) 컨텐츠에 포함되는 객체를 분류하는 방법, 장치 및 컴퓨터 프로그램
CN117333926B (zh) 一种图片聚合方法、装置、电子设备及可读存储介质
Sukri et al. iFR: A New Framework for Real-Time Face Recognition with Deep Learning
CN116704566A (zh) 人脸识别、用于人脸识别的模型训练方法、装置及设备
CN117612224A (zh) 一种图像识别方法、模型训练方法及相关装置
CN117789277A (zh) 表情识别方法、装置、电子设备及存储介质
EP4320585A1 (en) Bystander and attached object removal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857635

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021857635

Country of ref document: EP

Effective date: 20220901

NENP Non-entry into the national phase

Ref country code: DE