CN111597936A

CN111597936A - Face data set labeling method, system, terminal and medium based on deep learning

Info

Publication number: CN111597936A
Application number: CN202010374477.1A
Authority: CN
Inventors: 张攀; 闵梁
Original assignee: Shenzhen Inveno Technology Co ltd
Current assignee: Shenzhen Inveno Technology Co ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2020-08-28

Abstract

The invention provides a face data set labeling method, a face data set labeling system, a face data set labeling terminal and a face data set labeling medium based on deep learning, wherein the method comprises the following steps: acquiring a face data set to be marked, which comprises a plurality of pictures; constructing and training a face detection model, and storing the trained face detection model; detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain the face position in each picture so as to obtain a sample; and receiving a face name which is input by a user at the face position aiming at each picture in the sample, and finishing the labeling of the pictures in the face data set to be labeled. The method greatly improves the labeling efficiency of the face data set, and can provide a large number of accurate samples for the training of the face detection or face recognition model.

Description

Face data set labeling method, system, terminal and medium based on deep learning

Technical Field

The invention belongs to the technical field of face labeling, and particularly relates to a face data set labeling method, a face data set labeling system, a face data set labeling terminal and a face data set labeling medium based on deep learning.

Background

The face recognition technology is a biometric technology that performs identification based on facial feature information of a person. In the existing face recognition technology, a face data set is used for training a face recognition model, and then the trained face recognition model is used for face matching. Wherein the face data set requires the location and identity of the face to be marked in the face image.

The existing labeling of the face data set usually depends on manual labeling, a large amount of manpower is consumed, the labeling efficiency is low, and the requirement that a large amount of labeled data sets are needed in model training cannot be met. In order to solve the above problems, there are some methods for assisting annotation by a traditional machine learning method or a third-party service, such as:

the application No. 201410053879.6 discloses a method for labeling the identity of a human face image and a method for identifying the identity of a human face. The method comprises the steps of searching a webpage of an image to be marked by using a search engine, and determining the identity of the image according to the frequency of names appearing in the webpage and a third-party API. The specific flow is shown in fig. 1. However, this method is too much believed to be a search engine, and the results returned by the search engine may be web pages related to the sample to be labeled, and many times are not people in the sample to be labeled. Moreover, the method relies on a third-party API, and no self-existing face recognition scheme is given.

The face labeling method, apparatus and device mentioned in application No. 201310268319.8 use a clustering method to perform cluster modification on the face data set to be labeled. The method comprises the following specific steps: 1. acquiring a face distance between any two faces in a face database; 2. acquiring the neighbor face of the face to be clustered according to the face distance between the face to be clustered and other faces; 3. calculating a composite shared neighbor score between the face to be clustered and the neighbor face; 4. clustering the faces to be clustered according to the face distance and the composite shared neighbor score to obtain a class containing the faces; 5. and labeling faces which are not labeled in the classification. The specific flow is shown in fig. 2. First the clustering method is not well suited to datasets that are too categorized. If there are hundreds of thousands of millions of faces, the clustering model takes a long time and is difficult to converge. And it needs to tell in advance how many categories there are during clustering, and if a single picture in the data set contains many faces, clustering is not feasible in this scene.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method, a system, a terminal and a medium for labeling a face data set based on deep learning, which greatly improve the labeling efficiency of the face data set and can provide a large number of accurate samples for the training of a face detection or face recognition model.

In a first aspect, a face data set labeling method based on deep learning includes the following steps:

acquiring a face data set to be marked, which comprises a plurality of pictures;

constructing and training a face detection model, and storing the trained face detection model;

detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain the face position in each picture so as to obtain a sample;

and receiving a face name which is input by a user at the face position aiming at each picture in the sample, and finishing the labeling of the pictures in the face data set to be labeled.

Preferably, the acquiring a to-be-annotated face data set including a plurality of pictures specifically includes:

receiving keywords input by a user according to a character to be annotated;

searching in a preset database according to the keywords to obtain information matched with the character to be marked;

extracting pictures in the information;

constructing the face data set to be marked according to all the extracted pictures;

and storing the face data set to be marked.

Preferably, the face detection model is an SSH model; wherein the backbone network of the SSH model is VGG 16.

Preferably, the detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain the face position in each picture specifically includes:

loading the trained face detection model and a face data set to be labeled;

detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain a frame for identifying the position of the face;

detecting whether the number of frames in the picture is 0 or not, or whether the number of frames exceeds a preset upper limit value or not;

if so, setting the marking information of the picture to be invalid;

if not, all borders in the picture are obtained, and the borders with the IOU larger than the preset value are removed from all the borders of the picture by using an NMS algorithm, so that the face position in the picture is obtained.

Preferably, the receiving a face name entered by a user at a face position for each picture in the sample, and completing the labeling of the picture in the face data set to be labeled specifically includes:

obtaining the sample;

and receiving a user to select a corresponding figure to be labeled from a preset figure library to be labeled according to the face position of each picture in the sample, and finishing the labeling of the pictures in the face data set to be labeled.

In a second aspect, a face data set labeling system based on deep learning includes:

a face data set to be labeled: comprises a plurality of pictures;

a face detection model;

the online prediction service: the face detection module is used for detecting the face of the picture to be marked in the face data set by using the face detection model to obtain the face position in each picture so as to obtain a sample;

marking tools: the face name input by the user at the face position aiming at each picture in the sample is received, and the labeling of the pictures in the face data set to be labeled is completed.

Preferably, the online prediction service is specifically configured to:

loading a face detection model and a face data set to be labeled;

detecting the face of the picture in the face data set to be labeled by using a face detection model to obtain a frame for identifying the position of the face;

if so, setting the marking information of the picture to be invalid;

Preferably, the marking tool is specifically configured to:

obtaining the sample;

In a third aspect, a terminal comprises a processor, an input device, an output device, and a memory, the processor, the input device, the output device, and the memory being connected to each other, wherein the memory is configured to store a computer program, the computer program comprising program instructions, and the processor is configured to call the program instructions to execute the method of the first aspect.

In a fourth aspect, a computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the first aspect.

According to the technical scheme, the method, the system, the terminal and the medium for labeling the face data set based on deep learning greatly improve the labeling efficiency of the face data set, and can provide a large number of accurate samples for training of face detection or face recognition models.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

Fig. 1 is a flowchart of a first identity labeling method provided in the background of the present invention.

Fig. 2 is a flowchart of a second face labeling method provided in the background art of the present invention.

Fig. 3 is a flowchart of a face data set labeling method according to an embodiment of the present invention.

Fig. 4 is a flowchart of a method for acquiring a face data set to be labeled according to an embodiment of the present invention.

Fig. 5 is a flowchart of a method for acquiring a face position in a picture according to an embodiment of the present invention.

Fig. 6 is an image obtained from the face position according to the first embodiment of the present invention.

Fig. 7 is a diagram illustrating an image before frame culling by using the NMS algorithm according to an embodiment of the present invention.

Fig. 8 is a diagram illustrating an image with frames removed by the NMS algorithm according to an embodiment of the present invention.

Fig. 9 is an image before labeling a face name according to an embodiment of the present invention.

Fig. 10 is an image after labeling a face name according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby. It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In particular implementations, the terminals described in embodiments of the invention include, but are not limited to, other portable devices such as mobile phones, laptop computers, or tablet computers having touch sensitive surfaces (e.g., touch screen displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch screen display and/or touchpad).

In the discussion that follows, a terminal that includes a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

The terminal supports various applications, such as one or more of the following: a drawing application, a presentation application, a word processing application, a website creation application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a digital camera application, a web browsing application, a digital music player application, and/or a digital video player application.

Various applications that may be executed on the terminal may use at least one common physical user interface device, such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical architecture (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.

Labeling a face data set: the location and identity of the face in the face image are framed. Face detection: if the face exists in the face image, the position of the face is framed, and if a plurality of faces exist, a plurality of frames are drawn. Face recognition: if the image has a face, the identity of the face is identified, and if a plurality of faces exist, the identities of all the faces are given.

In the prior art, sample labeling is sometimes considered as a classification problem, namely, only the identity of a human face existing in an image is labeled, and the specific position of the human face is not labeled. In fact, face recognition is a target detection problem and not a classification problem. If the classification problem is considered, when the background in the image is too much or a plurality of faces appear, the classification model can not identify the specific identities of the plurality of faces in the image.

The first embodiment is as follows:

a face data set labeling method based on deep learning is disclosed, referring to FIG. 3, and comprises the following steps:

s1: acquiring a face data set to be marked, which comprises a plurality of pictures;

s2: constructing and training a face detection model, and storing the trained face detection model;

s3: detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain the face position in each picture so as to obtain a sample;

s4: and receiving a face name which is input by a user at the face position aiming at each picture in the sample, and finishing the labeling of the pictures in the face data set to be labeled.

Specifically, the method mainly utilizes a deep learning method to train a face detection model capable of identifying specific positions of faces in pictures, and frames specific positions of faces in pictures in a face data set to be labeled, so that a user (namely a labeling person) only needs to mark face names in the framed faces. If the position of the frame framed by the face detection model is inaccurate, the marking personnel can manually correct the position of the frame.

The face data set to be labeled obtained after labeling is used as a sample for subsequent face detection or face recognition, so that the method can provide a large number of accurate samples for a subsequent face detection model or a face recognition model. The method greatly improves the labeling efficiency of the face data set, and can provide a large number of accurate samples for the training of the face detection or face recognition model.

Referring to fig. 4, the acquiring a to-be-annotated face data set including a plurality of pictures specifically includes:

receiving keywords input by a user according to a character to be annotated;

extracting pictures in the information;

and storing the face data set to be marked.

In particular, both face detection and face recognition require a large amount of annotation data (i.e., samples) to train the model. Therefore, the model is finally good and bad, and the quality of the labeled data plays a great role. In the embodiment, the face data to be labeled mainly comes from pictures in information. For example, matching keywords with ES (abbreviation of Elastic Search), searching for information matching the person to be labeled as much as possible, and then extracting pictures from the information to form a face data set to be labeled. The information includes news, web pages, etc. The database may be a company's database or an online database. The finally obtained face data set to be labeled comprises data such as picture names, picture links, a target list, target quantity, picture sizes, a labeling frame list, task batches, operators, remarks, picture states, quality inspection states, quality inspectors, pre-classification, quality inspection time, pre-inspection target quantity, sample sources and the like.

Specifically, in the embodiment, an SSH (single Stage Face detector) model is used as the Face detection algorithm, and the SSH model has the advantages of fast inference, low memory consumption, and capability of detecting a multi-scale Face.

The backbone network of the SSH can be selected by the annotating personnel, and the backbone network of the SSH model in this embodiment uses the VGG 16. The SSH model has three detection modules, M1, M2, and M3. The step lengths used by the three detection modules are respectively 8, 16 and 32, and are respectively used for detecting the large, medium and small faces.

The data set used to train the SSH model in this embodiment is derived from the source data set WIDERFACE. The training environment and configuration is as follows: tensorflow 1.14, GPU GTX1080ti, Python 3.6, Ubuntu 16.04 and Cuda 10.0. The trained face detection model predicts (800 × 1200) a single picture in around 80 ms.

Referring to fig. 5, the detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain the face position in each picture specifically includes:

loading the trained face detection model and a face data set to be labeled;

if so, setting the marking information of the picture to be invalid;

Specifically, in the method of the embodiment, when the samples are filtered, samples without faces or with too many faces in the data set to be labeled are removed. In the embodiment, when the samples are screened, samples which do not need to be labeled are filtered, that is, samples with the frame number of 0 in the picture. And filtering out too many pictures of the human face, namely whether the number of the frames exceeds a preset upper limit value. When a certain picture is filtered, the method sets the marking information of the picture to be invalid and ignores the marking information. Fig. 6 is a picture of the face positions obtained according to the method, and three frames are recognized in fig. 6 because 3 faces exist.

Since there may be frames with large overlap (i.e. frames with an IOU greater than a preset value) in the frames output by the SSH model, the method removes the frames through an NMS (non-maximum suppression) algorithm, as shown in fig. 7 and 8, where fig. 7 and 8 are images before and after the frame is removed, respectively. The NMS algorithm essentially comprises the following steps:

1) sorting the scores of all the frames, and selecting the highest score and the corresponding frame:

2) and traversing the rest frames, and deleting the frame if the overlapping area (IOU) of the frame with the highest score is larger than a preset value.

3) And continuously selecting one frame with the highest score from the unprocessed frames, and repeating the process.

obtaining the sample;

Specifically, the step is used for calibrating the picture with the face position identified. Therefore, all samples seen by the annotating personnel are pictures with faces, face positions and face frames. The annotator only needs to annotate the identity of the character in the border. The images before and after the identity of the annotated person are shown in fig. 9 and 10. In order to facilitate the labeling of the labeling personnel, all the to-be-labeled characters in the to-be-labeled character library can be listed, and when the labeling personnel label, the labeling personnel only need to select from the listed to-be-labeled characters, do not need to manually input the to-be-labeled characters, and are convenient and rapid.

Example two:

a face dataset labeling system based on deep learning, comprising:

a face data set to be labeled: comprises a plurality of pictures;

a face detection model;

specifically, the online prediction service can adopt a java development environment, a Tensflow development toolkit loads a pb model, and a service framework adopts a springboot to meet a high concurrency request scene.

Preferably, the online prediction service is specifically configured to:

loading a face detection model and a face data set to be labeled;

if so, setting the marking information of the picture to be invalid;

Preferably, the marking tool is specifically configured to:

obtaining the sample;

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

For the sake of brief description, the system provided by the embodiment of the present invention may refer to the corresponding content in the foregoing method embodiments.

Example three:

a terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method described above.

It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output device may include a display (LCD, etc.), a speaker, etc.

The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

For a brief description, the embodiment of the present invention may refer to the corresponding content in the foregoing method embodiments.

Example four:

a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the above-mentioned method.

The computer readable storage medium may be an internal storage unit of the terminal according to any of the foregoing embodiments, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

For the sake of brief description, the media provided by the embodiments of the present invention, and the portions of the embodiments that are not mentioned, refer to the corresponding contents in the foregoing method embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A face data set labeling method based on deep learning is characterized by comprising the following steps:

2. The method for labeling a face data set based on deep learning of claim 1, wherein the obtaining of the face data set to be labeled including a plurality of pictures specifically comprises:

receiving keywords input by a user according to a character to be annotated;

extracting pictures in the information;

and storing the face data set to be marked.

3. The deep learning based human face data set labeling method of claim 1, wherein the human face detection model is an SSH model; wherein the backbone network of the SSH model is VGG 16.

4. The method for labeling a face data set based on deep learning of claim 2, wherein the detecting the face of the picture in the face data set to be labeled by using the trained face detection model to obtain the face position in each picture specifically comprises:

loading the trained face detection model and a face data set to be labeled;

if so, setting the marking information of the picture to be invalid;

5. The method for labeling a face data set based on deep learning of claim 4, wherein the receiving of the face name entered by the user at the face position for each picture in the sample to complete the labeling of the picture in the face data set to be labeled specifically comprises:

obtaining the sample;

6. A face data set labeling system based on deep learning is characterized by comprising:

a face data set to be labeled: comprises a plurality of pictures;

a face detection model;

7. The deep learning based face data set annotation system of claim 6, wherein the online prediction service is specifically configured to:

loading a face detection model and a face data set to be labeled;

if so, setting the marking information of the picture to be invalid;

8. The deep learning based face data set annotation system of claim 6, wherein the annotation tool is specifically configured to:

obtaining the sample;

9. A terminal, comprising a processor, an input device, an output device, and a memory, the processor, the input device, the output device, and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-5.

10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-5.