CN110807108A - Asian face data automatic collection and cleaning method and system - Google Patents

Asian face data automatic collection and cleaning method and system Download PDF

Info

Publication number
CN110807108A
CN110807108A CN201910977959.3A CN201910977959A CN110807108A CN 110807108 A CN110807108 A CN 110807108A CN 201910977959 A CN201910977959 A CN 201910977959A CN 110807108 A CN110807108 A CN 110807108A
Authority
CN
China
Prior art keywords
data
asian
face
target
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910977959.3A
Other languages
Chinese (zh)
Inventor
丁长兴
黄英杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910977959.3A priority Critical patent/CN110807108A/en
Priority to PCT/CN2020/070658 priority patent/WO2021072998A1/en
Publication of CN110807108A publication Critical patent/CN110807108A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/54Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a method and a system for automatically collecting and cleaning Asian face data, wherein the method comprises the following steps: presetting a plurality of Asian target character identifications, acquiring an official photo link of the Asian target character, and constructing a data list comprising key information of the Asian target character identifications; searching and acquiring reference character data associated with the Asian target character identification and the keywords according to the content of the data list; the reference character data, the corresponding Asian target character identification and the corresponding keywords are stored in an associated mode; and cleaning the stored reference character data to obtain target face data associated with the Asian target characters. The invention realizes automatic collection and cleaning of Asian face data, replaces the traditional heavy procedures of manual labeling, classification and the like, greatly reduces the labor and time cost for establishing the Asian face database, and also solves the problem of unbalanced category in the existing face database.

Description

Asian face data automatic collection and cleaning method and system
Technical Field
The invention relates to the technical field of image processing and recognition, in particular to a method and a system for automatically collecting and cleaning Asian face data.
Background
Most of the existing face recognition technologies adopt a face recognition model trained based on a deep learning method, in order to improve the recognition accuracy, the model needs to be trained by using a database containing a large number of marked face photos, the source of the face photos is mainly downloaded and collected from the internet through a crawler means, and then the photos need to be marked and cleaned through complicated manual operation, so that the series of work has very high requirements on computing and storage equipment, and needs to invest a large amount of labor and time cost. The internet huge company has unique picture resources and operation capital advantages, and is provided with private large-scale Face data sets, but so far, common users can obtain very few large-scale public Face data sets free of charge, and the mainstream public Face data sets mainly comprise Youtube Face, CASIA-Webface, MS-1M-Celeb and the like.
More, most existing face data sets are dominated by European and American face data, which only contains a very small amount of Asian face data, and training a neural network by using the face data sets with unbalanced classes easily causes the problem that the network potentially has 'racial discrimination'. In addition, the asian face data sets that can be obtained on the internet so far are very few, and there are often problems such as a small number of people and a small amount of face data. Therefore, on the premise of limited labor and time cost, a large Asian face data set is established, and the method has very important scientific research and commercial values.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention provides the Asian face data automatic collection and cleaning method and the Asian face data automatic collection and cleaning system, which can automatically clean the collected Asian face photo data, achieve the effects of low time cost and less per-person workload, and can establish the Asian face database with higher recall rate.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a method for automatically collecting and cleaning Asian face data, which comprises the following steps:
presetting a plurality of Asian target character identifications, acquiring an official photo link of the Asian target character, and constructing a data list, wherein the data list comprises a plurality of Asian target character identification key information;
according to the content of the data list, after keywords are added to the Asian target character identification and the character identification, searching is carried out to obtain reference character data, and the specific steps comprise:
obtaining an official photo of the Asian target character according to the Asian target character official photo link;
searching the Asian target character identification to obtain associated reference character data under the condition of single identification;
adding a plurality of keywords to the Asian target character identification for searching, and respectively acquiring associated reference character data corresponding to different keywords on a single identification combination;
performing associated storage on the searched reference character data, the corresponding Asian target character identification and the keywords until all the reference character data are stored in an associated manner;
and cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
As a preferred technical solution, the asian target person identifiers are names of different asian target persons or numbers for distinguishing different asian target persons, the asian target person official photo links are URL links, and each row of the data list corresponds to one asian target person identifier and the corresponding official photo URL link.
As a preferred technical solution, the step of storing the searched reference character data in association with the corresponding asian target character identifier and the keyword comprises the following specific steps:
creating a main folder, wherein the main folder is named by adopting an Asian target character identifier, and a plurality of subfolders are created in the main folder and are named by adopting keywords respectively;
and storing the reference person data acquired by different searching modes in corresponding subfolders in a related manner.
As a preferred technical scheme, the step of cleaning the stored reference person data comprises the following steps:
performing readability inspection and format unification on all reference character data by adopting a picture processing tool, and removing the reference character data which cannot be read and written normally;
deleting the obtained repeatedly downloaded reference character data, and reserving official photos of the Asian target characters;
primary cleaning: performing face detection on the reference figure data by adopting a face detection algorithm to obtain reference face data subjected to face detection processing;
deep cleaning: and detecting reference face data after face detection processing by adopting a face recognition algorithm, updating an Asian target figure identification official photo list, checking whether the reference face data are matched with the Asian target figure identification official photo list, if not, deleting the reference face data after face detection processing, and if so, keeping the reference face data after face detection processing as target face data associated with the Asian target figure.
As a preferred technical solution, the deleting the obtained repeatedly downloaded reference character data specifically includes:
and after the reference character data, the corresponding Asian target character identification and the keywords are stored in a folder in an associated mode, the repeated downloading adopts the standard that whether the file names of the reference character data are the same or not, if the repeated downloading exists, one of the repeatedly downloaded reference character data is reserved, and the rest repeatedly downloaded reference character data is deleted.
As a preferred technical scheme, the face detection of the reference person data by using the face detection algorithm specifically comprises the following steps:
the positions of key points of the human face in the reference human figure data are positioned through a human face detection algorithm and a human face frame is detected,
if the face frame does not exist, deleting the reference character data;
if a face frame exists, cutting out reference face data in the face frame, and keeping the reference character data;
if a plurality of face frames exist, cutting out each reference face data in the plurality of face frames respectively, simultaneously retaining the reference character data, extracting the official photo corresponding to the Asian target character and the face features of each reference face data, respectively calculating the matching degree of the face features of each reference face data and the face features of the official photo corresponding to the Asian target character, and retaining the reference character data corresponding to the reference face data with the highest matching degree as the reference face data after face detection processing.
As a preferred technical scheme, the deep cleaning comprises the following specific steps:
extracting the official photos of the Asian target characters and the face features of the reference face data corresponding to the preliminarily cleaned Asian target characters based on a face recognition algorithm;
respectively calculating the matching degree of the face features of the Asian target character official photo and the face features of the reference face data corresponding to the preliminarily cleaned face features, classifying the reference face data with the matching degree being greater than or equal to a first preset threshold value into a target character official photo list, and updating the target character official photo list;
matching the face features of the reference face data after the residual preliminary cleaning with the face features in the updated Asian target character official photo list one by one, keeping the reference face data with the matching degree larger than or equal to a second preset threshold value, and deleting the rest reference face data;
the first preset threshold is greater than a second preset threshold.
The invention also provides a system for automatically collecting and cleaning Asian face data, which comprises: the system comprises a data list construction module, a reference character data acquisition module, an association storage module and a reference character data cleaning module;
the data list building module is used for obtaining Asian target character official photo links by presetting a plurality of Asian target character identifications and building a data list;
the reference character data acquisition module is used for acquiring reference character data associated with the Asian target character identification and the character identification key words according to the content of the data list;
the association storage module is used for performing association storage on the reference character data, the corresponding Asian target character identification and the keywords;
the reference character data cleaning module is used for cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
As a preferred technical scheme, the reference character data cleaning module comprises a preliminary cleaning submodule and a deep cleaning submodule, wherein the preliminary cleaning submodule is used for carrying out face detection on reference character data by adopting a face detection algorithm to obtain reference face data after face detection processing, the deep cleaning submodule is used for detecting the reference face data after face detection processing by adopting a face recognition algorithm, updating an asian target character identification official photo list, checking whether the reference face data are matched with the asian target character identification official photo list, and taking the matched reference face data as target face data associated with the asian target character.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the invention adopts the picture processing tool to carry out readability inspection and format unification on all reference character data, and deletes the reference character data which is repeatedly downloaded, thereby improving the fluency and the processing efficiency of the subsequent cleaning process.
(2) The invention adopts a plurality of searching modes to acquire the reference character data, and increases the diversity and the accuracy of acquiring the reference character data.
(3) The invention adopts primary cleaning and deep cleaning to carry out data cleaning, obtains target face data associated with the target figure identification, and updates an Asian target figure identification official photo list in the deep cleaning, namely updates a reference database of reference, thereby improving the accuracy of face feature comparison.
(4) The invention automatically processes the whole process from the collection to the cleaning of the Asian face data, replaces the traditional heavy procedures of manual labeling, classification and the like, greatly reduces the time cost for establishing the Asian face database, and also solves the problems of unbalanced category and the like in the existing face database.
Drawings
Fig. 1 is a schematic flow chart of an asian face data automatic collection and cleaning method according to the present embodiment;
fig. 2 is a schematic diagram of a data list of the asian face data automatic collection and cleaning method according to the embodiment;
fig. 3 is a schematic diagram illustrating the effect of the asian face data automatic collection and cleaning method of the present embodiment when cleaning is not performed;
fig. 4 is a schematic diagram illustrating the effect of the initial cleaning in the asian face data automatic collection and cleaning method of the present embodiment;
fig. 5 is a schematic diagram illustrating the effect of asian face data collection in the asian face data automatic collection and cleaning method according to the embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
As shown in fig. 1, a method for automatically collecting and cleaning asian face data includes the following steps:
s1, presetting a plurality of Asian target character identifications, acquiring corresponding official photo links, and constructing a data list comprising key information of the Asian target character identifications;
the multiple asian target character identifiers in step S1 are names of different asian target characters or preset number numbers for distinguishing different asian target characters; for example, the Baidu company provides a list named as "Baidu encyclopedia of Mingxue" which includes submodules such as "Mengxing in China", "Mingxing in China", "Mengxing Mingxing in Hongtao in Dongnan of the Port platform", "Mingxing Mengxing in Dongnan of the Port platform", and the like, and can automatically acquire a selected list by using a crawler technology, sequentially record names of the Mingxing in the submodules of the Mingxing in the Asia on the list on a data list, and optionally take an integer which starts from 0 and gradually increases as a digital label for distinguishing the Mingxing; the embodiment has two advantages of using public people as the Asian target person identification, can conveniently obtain a large number of photos of specified target persons on a search engine through identification, and can avoid any privacy and infringement problems caused by using the photos;
in the official photograph link in step S1, an URL link of the downloaded asian target character official photograph is provided by using an encyclopedia in the present embodiment. For example, the encyclopedia star leader board of encyclopedia sequentially displays the official photos and names of the stars according to the real-time popularity of the stars, clicks the official photos or names of the stars to enter the corresponding encyclopedia introduction page, and obtains the URL link of the star showing the official photos in the page through the crawler technology, and records the link on the data list.
As shown in fig. 2, in the schematic diagram of the list for obtaining the reference character data, the reference character data includes key information of a plurality of asian target character identifications, each row corresponds to one asian target character identification and its corresponding official photograph URL link, and the digital label, the name and the URL link are sequentially arranged from left to right, and the tab '\ t' is used as an interval;
s2, according to the content of the data list, searching the Asian target character identification and the character identification after adding keywords to obtain reference character data;
the step S2 includes the following steps:
obtaining the official photo of the Asian target person by using a computer technology according to the URL link of the official photo of the Asian target person, wherein the computer technology can be selected from but not limited to a web crawler, a downloader tool and the like;
searching the Asian target character identification, and acquiring at least one piece of reference character data related under the condition of single identification; specifically, if the asian target character identifier is the name of the target character, the reference character data is a related photo of the asian target character, for example, the present embodiment may employ a process of simulating manual search and downloading a picture using a Python script, and a large amount of reference character data may be easily obtained by performing the asian target character identifier search on a hundred-degree picture search engine;
in the embodiment, the target character identification and the keyword are searched, and at least one piece of associated reference character data corresponding to different keywords in a single identification combination is respectively obtained, wherein the keyword can be selected from but not limited to glasses, hats, actors, singers (vocational types) and the like, so that various reference character data such as 'target character name + glasses', 'target character name + hats', 'target character name + vocational types' and the like can be obtained, and the diversity and the accuracy of obtaining the reference character data can be increased;
s3, sequentially carrying out association storage on at least one piece of reference character data, the corresponding Asian target character identification and the data added with the keywords until all the reference character data are stored in an associated manner;
specifically, photo collection and storage are performed on each asian target person identifier in the data list, for example:
creating a main folder and naming the main folder by using an Asian target character identifier (such as a digital label corresponding to the Asian target character), then creating a plurality of sub-folders in the main folder, and naming the sub-folders by using keywords (such as English spelling of the keywords) respectively;
storing at least one reference character data of the Asian target characters acquired by different searching modes in a corresponding subfolder in a related manner;
s4, automatically cleaning the stored reference character data in sequence to obtain target face data associated with the Asian target characters;
the step S4 includes the following steps:
s41, performing readability test and format unification on all reference character data by using a picture processing tool, and eliminating small parts of reference character data which cannot be read and written normally due to download errors, format errors and the like, wherein the picture processing tool can be selected from but not limited to image processing software or programming languages such as MATLAB, Python, OpenCV, Photoshop and the like; for example, when pictures are downloaded from the internet in batches by a crawler method, the downloaded pictures are often influenced by network fluctuation and an anti-crawler mechanism, so that the problems of incomplete downloaded picture content, wrong downloading and the like are caused, the pictures generally cannot be read and written normally by software, the fluency of cleaning data is seriously influenced, the pictures are removed, and the data processing efficiency is improved; in addition, for the convenience of subsequent processing and data management, before the data cleaning work is started, MATLAB can be selected to unify all reference character data into a common JEPG format;
s42, deleting repeatedly downloaded reference character data acquired by the Asian target characters in different searching modes, and keeping official photos of the Asian target characters;
wherein, the repeated downloading takes the same file name of reference character data as a standard; for example, a main folder is created for searching a public person, a searching mode can be selected from ' a person ' name, a person ' name + hat, a person ' name + glasses, a person ' name + singer ' and the like ', therefore, a plurality of subfolders name, hat, glass, job and the like are created in the main folder respectively, official photos are stored in a standby subfolder, reference person data are obtained in a search engine through different combination modes, the engine inevitably returns reference person data with consistent part names and contents, the repeated data easily causes overfitting of a neural network and seriously influences the face recognition performance of the network, and therefore the aim of the step is to delete only one repeated file aiming at all files with the same name in the subfolders; in the embodiment, official photos downloaded from the URL are named as standard.jpg, and photos obtained by other batch searching and downloading are named as photos of a source network, so that the names of official photos in the standard subfolder are not repeated with other photos; or the name of the picture in the official part of the standard subfolder is not changed, a judgment link is added, and if one of the repeated pictures is located in the standard subfolder, the picture in the standard subfolder is reserved.
The specific steps of step S42 include:
sequentially recording file names of reference character data in subfolders related to the Asian target characters by using a program, wherein the program can be written by using MATLAB, Python and other languages;
if the repeated downloading exists, only one reference character data is reserved, and the rest repeated reference character data is deleted;
s43, performing face detection work on the reference character data based on a face detection algorithm to enable the reference character data after cleaning to only comprise the reference face data, wherein the face detection work comprises a series of steps of face detection, face correction, face alignment and the like, and the reference character data before cleaning comprises the reference face data and the reference non-face data;
the specific steps of step S43 include:
the method comprises the steps of carrying out face detection work on reference character data based on a face detection algorithm to obtain reference face data after face detection processing, wherein the face detection algorithm can adopt but is not limited to a deep learning method such as MTCNN (multiple-point transmission network), the deep learning method such as MTCNN comprises a series of steps such as face detection, face correction and face alignment, non-face data such as pictures such as glasses and hats can be removed through face detection, and the face correction and the face alignment can correct and align the side-biased faces and improve the processing efficiency of follow-up face feature matching.
As shown in fig. 3, when no cleaning operation is performed, the reference character data includes reference face data and reference non-face data, the preliminary cleaning operation performed in this embodiment is to remove the reference non-face data from the reference character data, and common methods include an MTCNN deep learning method and a face detection kit carried by OpenCV software, and the specific principle is that positions of five key points (eyes, nose, and mouth corner) of a face in a photo are detected and located by an algorithm and returned to a face frame, and whether the reference character data is the reference face data can be determined according to whether the face frame is returned; if the face frame is not returned, deleting the photo; if a face frame is returned, cutting out reference face data in the face frame, and keeping the photo; if more than one face frame is returned, respectively cutting out reference face data in the more than one face frame, simultaneously reserving the reference character data subjected to face detection processing, then selecting one reference face data with the highest degree of matching with the Asian target character official photograph based on a face recognition algorithm, and deleting the rest reference face data;
further, for the situation that the face detection algorithm returns more than one face frame, it is necessary to create a folder named by the name of the photo, cut out the faces in the face frames in the photo, and store them in the folder in a distinguishable manner, then, extracting the official photo of the Asian target figure corresponding to the photo and the face features of each face frame photo in the folder based on a face recognition algorithm, respectively calculating the matching degree of the face features of each face frame photo and the face features of the official photo of the corresponding Asian target figure, reserving the face frame photo with the highest matching degree, the rest face frame photos are deleted, the reserved photos are moved out of the folder and replace the original reference character data, meanwhile, the folder is deleted, as shown in fig. 4, it can be seen that a great part of reference non-face data is removed after the preliminary cleaning;
s44, detecting whether the reference face data after the preliminary cleaning is matched with the Asian target character identification official photo based on a face recognition algorithm, if not, deleting the reference face data after the preliminary cleaning, and if so, keeping the reference face data after the preliminary cleaning to obtain target face data associated with the target character; specifically, after the preliminary cleaning work is performed on the reference character data, it cannot be guaranteed that the obtained reference face data belong to the target characters, so that the obtained reference face data needs to be subjected to deep cleaning work, that is, the target face data with the same asian target character identifier is reserved, and the target face data different from the asian target character identifier is deleted, so that the final data cleaning work is completed;
the specific steps of step S44 include:
extracting the official photo of the Asian target figure and the face features of the reference face data after the official photo is correspondingly cleaned based on a face recognition algorithm;
respectively calculating the matching degrees of the face features of the Asian target person official photo and the face features of the preliminarily cleaned reference face data, and classifying the cleaned reference face data with the matching degree being more than or equal to a first preset threshold value into a target person official photo list, wherein the target person official photo list not only is an official photo, but also comprises a target person photo with high matching degree obtained by the search of the target person adding keywords, because in the face data of the Asian target person obtained by different search modes, the face data of the Asian target person obtained by the keyword search is compared with the face features of the target person official photo obtained by only URL link of the Asian target person official photo in the step S2, the situation with low matching degree is easy to occur, for example, the official photo is a front face photo and is not decorated with ornaments such as hats, glasses and the like, the photo searched by adding the keyword (such as a hat and glasses) (after face recognition and screening) is compared with the official photo which is not decorated with ornaments such as a hat and glasses, so that the matching deviation is easy to occur, therefore, the target person photo with high matching degree obtained by searching the target person adding the keyword needs to be added to a target person official photo list, and the target person official photo list is updated, so as to increase the accuracy of the face feature comparison;
and matching the face characteristics of the residual preliminarily cleaned reference face data with the updated data in the official photo list of the target person one by one, keeping the residual preliminarily cleaned reference face data with the matching degree larger than or equal to a second preset threshold value, and deleting the residual preliminarily cleaned reference face data.
In the embodiment, a face recognition algorithm is used for sequentially extracting feature vectors of reference face data corresponding to an Asian target person after preliminary cleaning, meanwhile, a first preset threshold value is set to be 0.9, then the feature vectors of the reference face data are matched with feature vectors of official photos one by one, the reference face data with the matching degree larger than or equal to 0.9 can be screened out in the first round, and the photos can be considered as the official photos of the target person; then, setting a second preset threshold value to be 0.7, matching the feature vectors of the residual reference face data with the feature vectors of the photos in the official photo queue screened in the first round one by one, and keeping the residual reference face data as long as the matching degree of the feature vectors of the photos in any one official photo queue is greater than or equal to 0.7, or deleting the residual reference face data; the first preset threshold and the second preset threshold used in the embodiments may be adjusted according to actual conditions.
As shown in fig. 5, after two rounds of cleaning, an asian face database with high purity can be obtained, a great number of pictures in the folder identified by each asian target person belong to the target person, and the number of noise pictures is very small.
This embodiment still provides an asian face data automatic collection and abluent system, includes: the system comprises a data list construction module, a reference character data acquisition module, an association storage module and a reference character data cleaning module;
in the present embodiment, the method includes: the system comprises a data list construction module, a reference character data acquisition module, an association storage module and a reference character data cleaning module;
in the embodiment, the data list construction module is used for obtaining the official photo links of the Asian target characters by presetting a plurality of Asian target character identifications and constructing a data list; the reference character data acquisition module is used for acquiring reference character data associated with the Asian target character identification and the character identification key words according to the content of the data list; the association storage module is used for performing association storage on the reference character data, the corresponding Asian target character identification and the keywords; and the reference character data cleaning module is used for cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
In this embodiment, the reference character data cleaning module includes a preliminary cleaning sub-module and a deep cleaning sub-module, the preliminary cleaning sub-module is configured to perform face detection on the reference character data by using a face detection algorithm to obtain reference face data after the face detection processing, the deep cleaning sub-module is configured to detect the reference face data after the face detection processing by using a face recognition algorithm, update an asian target character identification official photo list, check whether the asian target character identification official photo list is matched with the reference face data, and use the matched reference face data as target face data associated with the asian target character.
In the whole process from the collection to the cleaning of the Asian face data, the automatic processing replaces the traditional heavy procedures of manual labeling, classification and the like, so that the time cost for establishing the Asian face database is greatly reduced, the problems of unbalanced categories and the like in the existing face database are solved, and the development progress of the corresponding technology is promoted.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (9)

1. A method for automatically collecting and cleaning Asian face data is characterized by comprising the following steps:
presetting a plurality of Asian target character identifications, acquiring an official photo link of the Asian target character, and constructing a data list, wherein the data list comprises a plurality of Asian target character identification key information;
according to the content of the data list, after keywords are added to the Asian target character identification and the character identification, searching is carried out to obtain reference character data, and the specific steps comprise:
obtaining an official photo of the Asian target character according to the Asian target character official photo link;
searching the Asian target character identification to obtain associated reference character data under the condition of single identification;
adding a plurality of keywords to the Asian target character identification for searching, and respectively acquiring associated reference character data corresponding to different keywords on a single identification combination;
performing associated storage on the searched reference character data, the corresponding Asian target character identification and the keywords until all the reference character data are stored in an associated manner;
and cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
2. The method of claim 1, wherein said asian target character identifiers are different asian target character names or different asian target character numbers, said asian target character official photo links are URL linked, and each row of said data list corresponds to an asian target character identifier and a corresponding official photo URL link.
3. The Asian face data automatic collection and cleaning method as claimed in claim 1, wherein the storing of the searched reference character data in association with the corresponding Asian target character identifier and keywords comprises the following specific steps:
creating a main folder, wherein the main folder is named by adopting an Asian target character identifier, and a plurality of subfolders are created in the main folder and are named by adopting keywords respectively;
and storing the reference person data acquired by different searching modes in corresponding subfolders in a related manner.
4. The Asian face data automatic collection and cleaning method as claimed in claim 1, wherein the cleaning of the stored reference character data comprises the following steps:
performing readability inspection and format unification on all reference character data by adopting a picture processing tool, and removing the reference character data which cannot be read and written normally;
deleting the obtained repeatedly downloaded reference character data, and reserving official photos of the Asian target characters;
primary cleaning: performing face detection on the reference figure data by adopting a face detection algorithm to obtain reference face data subjected to face detection processing;
deep cleaning: and detecting reference face data after face detection processing by adopting a face recognition algorithm, updating an Asian target figure identification official photo list, checking whether the reference face data are matched with the Asian target figure identification official photo list, if not, deleting the reference face data after face detection processing, and if so, keeping the reference face data after face detection processing as target face data associated with the Asian target figure.
5. The Asian face data automatic collection and cleaning method according to claim 4, wherein the deleting of the repeatedly downloaded reference character data comprises the following specific steps:
and after the reference character data, the corresponding Asian target character identification and the keywords are stored in a folder in an associated mode, the repeated downloading adopts the standard that whether the file names of the reference character data are the same or not, if the repeated downloading exists, one of the repeatedly downloaded reference character data is reserved, and the rest repeatedly downloaded reference character data is deleted.
6. The Asian face data automatic collection and cleaning method as claimed in claim 4, wherein the face detection of the reference person data by the face detection algorithm comprises the following steps:
positioning the positions of key points of the human face in the reference person data through a human face detection algorithm and detecting a human face frame;
if the face frame does not exist, deleting the reference character data;
if a face frame exists, cutting out reference face data in the face frame, and keeping the reference character data;
if a plurality of face frames exist, cutting out each reference face data in the plurality of face frames respectively, simultaneously retaining the reference character data, extracting the official photo corresponding to the Asian target character and the face features of each reference face data, respectively calculating the matching degree of the face features of each reference face data and the face features of the official photo corresponding to the Asian target character, and retaining the reference character data corresponding to the reference face data with the highest matching degree as the reference face data after face detection processing.
7. The Asian face data automatic collection and cleaning method as claimed in claim 4, wherein the deep cleaning comprises the following specific steps:
extracting the official photos of the Asian target characters and the face features of the reference face data corresponding to the preliminarily cleaned Asian target characters based on a face recognition algorithm;
respectively calculating the matching degree of the face features of the Asian target character official photo and the face features of the reference face data corresponding to the preliminarily cleaned face features, classifying the reference face data with the matching degree being greater than or equal to a first preset threshold value into a target character official photo list, and updating the target character official photo list;
matching the face features of the reference face data after the residual preliminary cleaning with the face features in the updated Asian target character official photo list one by one, keeping the reference face data with the matching degree larger than or equal to a second preset threshold value, and deleting the rest reference face data;
the first preset threshold is greater than a second preset threshold.
8. An Asian face data automatic collection and cleaning system, comprising: the system comprises a data list construction module, a reference character data acquisition module, an association storage module and a reference character data cleaning module;
the data list building module is used for obtaining Asian target character official photo links by presetting a plurality of Asian target character identifications and building a data list;
the reference character data acquisition module is used for acquiring reference character data associated with the Asian target character identification and the character identification key words according to the content of the data list;
the association storage module is used for performing association storage on the reference character data, the corresponding Asian target character identification and the keywords;
the reference character data cleaning module is used for cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
9. The Asian face data automatic collection and cleaning system according to claim 8, wherein the reference character data cleaning module comprises a preliminary cleaning sub-module and a deep cleaning sub-module, the preliminary cleaning sub-module is used for performing face detection on the reference character data by adopting a face detection algorithm to obtain the reference face data after face detection processing, the deep cleaning sub-module is used for detecting the reference face data after face detection processing by adopting a face recognition algorithm, updating the Asian target character identification official photo list, checking whether the reference face data are matched with the Asian target character identification official photo list, and taking the matched reference face data as the target face data associated with the Asian target character.
CN201910977959.3A 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system Pending CN110807108A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910977959.3A CN110807108A (en) 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system
PCT/CN2020/070658 WO2021072998A1 (en) 2019-10-15 2020-01-07 Method and system for automatic collection and cleaning of asian face data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977959.3A CN110807108A (en) 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system

Publications (1)

Publication Number Publication Date
CN110807108A true CN110807108A (en) 2020-02-18

Family

ID=69488429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977959.3A Pending CN110807108A (en) 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system

Country Status (2)

Country Link
CN (1) CN110807108A (en)
WO (1) WO2021072998A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680202A (en) * 2020-04-24 2020-09-18 烽火通信科技股份有限公司 Body-based face image data collection method and device
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083572B (en) * 2022-07-25 2023-07-21 广州思德医疗科技有限公司 Picture storing and extracting method, system, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984738A (en) * 2014-05-22 2014-08-13 中国科学院自动化研究所 Role labelling method based on search matching
US20170083755A1 (en) * 2014-06-16 2017-03-23 Beijing Sensetime Technology Development Co., Ltd Method and a system for face verification
CN106844412A (en) * 2016-11-02 2017-06-13 厦门中控生物识别信息技术有限公司 A kind of human face data collection method and device
CN108319938A (en) * 2017-12-31 2018-07-24 奥瞳系统科技有限公司 High quality training data preparation system for high-performance face identification system
CN109063784A (en) * 2018-08-23 2018-12-21 深圳码隆科技有限公司 A kind of character costume image data screening technique and its device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102938065B (en) * 2012-11-28 2017-10-20 北京旷视科技有限公司 Face feature extraction method and face identification method based on large-scale image data
CN106874898B (en) * 2017-04-08 2021-03-30 复旦大学 Large-scale face recognition method based on deep convolutional neural network model
CN109241310B (en) * 2018-07-25 2020-05-01 南京甄视智能科技有限公司 Data duplication removing method and system for human face image database
CN109034106B (en) * 2018-08-15 2022-06-10 北京小米移动软件有限公司 Face data cleaning method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984738A (en) * 2014-05-22 2014-08-13 中国科学院自动化研究所 Role labelling method based on search matching
US20170083755A1 (en) * 2014-06-16 2017-03-23 Beijing Sensetime Technology Development Co., Ltd Method and a system for face verification
CN106844412A (en) * 2016-11-02 2017-06-13 厦门中控生物识别信息技术有限公司 A kind of human face data collection method and device
CN108319938A (en) * 2017-12-31 2018-07-24 奥瞳系统科技有限公司 High quality training data preparation system for high-performance face identification system
CN109063784A (en) * 2018-08-23 2018-12-21 深圳码隆科技有限公司 A kind of character costume image data screening technique and its device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680202A (en) * 2020-04-24 2020-09-18 烽火通信科技股份有限公司 Body-based face image data collection method and device
CN111680202B (en) * 2020-04-24 2022-04-26 烽火通信科技股份有限公司 Body-based face image data collection method and device
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data

Also Published As

Publication number Publication date
WO2021072998A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
CN110807108A (en) Asian face data automatic collection and cleaning method and system
CN111753120B (en) Question searching method and device, electronic equipment and storage medium
CN113343012B (en) News matching method, device, equipment and storage medium
CN113205046B (en) Method, system, device and medium for identifying books
WO2021219117A1 (en) Image retrieval method, image retrieval device, image retrieval system and image display system
CN114780370A (en) Data correction method and device based on log, electronic equipment and storage medium
CN105183950B (en) A kind of method and system for consulting engineering drawing based on mobile terminal
CN117952209A (en) Knowledge graph construction method and system
CN114611618A (en) Cross-modal retrieval-oriented data acquisition processing method and system
CN106055636B (en) portable intelligent rock identification method
CN114936840A (en) Intelligent identification method for power business work order information based on image classification and OCR technology
CN113723501A (en) Maximum diversity clustering construction method of pathogenic microorganism reference knowledge base
CN112559785A (en) Bird image recognition system and method based on big data training
CN108334602B (en) Data annotation method and device, electronic equipment and computer storage medium
CN116524263A (en) Semi-automatic labeling method for fine-grained images
CN112597862B (en) Method and equipment for cleaning face data
CN116226108A (en) Data management method and system capable of realizing different management degrees
CN113313178B (en) Cross-domain image example level active labeling method
CN113361395B (en) AI face-changing video detection method based on multitask learning model
CN112364790B (en) Airport work order information identification method and system based on convolutional neural network
CN114118305A (en) Sample screening method, device, equipment and computer medium
CN115033543B (en) Self-service government affair data storage system and self-service government affair terminal
CN113850301B (en) Training data acquisition method and device, model training method and device
TWI497425B (en) Method, apparatus and reptile server for digital image recognition
CN112508102B (en) Text recognition method, device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200218

RJ01 Rejection of invention patent application after publication