CN110807108A - Asian face data automatic collection and cleaning method and system - Google Patents

Asian face data automatic collection and cleaning method and system Download PDF

Info

Publication number
CN110807108A
CN110807108A CN201910977959.3A CN201910977959A CN110807108A CN 110807108 A CN110807108 A CN 110807108A CN 201910977959 A CN201910977959 A CN 201910977959A CN 110807108 A CN110807108 A CN 110807108A
Authority
CN
China
Prior art keywords
data
asian
face
target
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910977959.3A
Other languages
Chinese (zh)
Inventor
丁长兴
黄英杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910977959.3A priority Critical patent/CN110807108A/en
Priority to PCT/CN2020/070658 priority patent/WO2021072998A1/en
Publication of CN110807108A publication Critical patent/CN110807108A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/54Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)
  • Collating Specific Patterns (AREA)

Abstract

本发明公开了一种亚洲人脸数据自动收集及清洗的方法和系统,该方法包括下述步骤:预设多个亚洲目标人物标识,获取亚洲目标人物官方照片链接,构建包括多个亚洲目标人物标识关键信息的数据列表;根据数据列表内容搜索获取与亚洲目标人物标识以及关键词相关联的参考人物数据;将参考人物数据与对应的亚洲目标人物标识及关键词进行关联存储;清洗存储完毕的参考人物数据,得到与亚洲目标人物关联的目标人脸数据。本发明实现亚洲人脸数据自动收集及清洗,代替了传统的手工标注、分类等繁重工序,大幅度地降低了建立亚洲人脸数据库的人力和时间成本,也解决了现有人脸数据库中类别不平衡的问题。

Figure 201910977959

The invention discloses a method and system for automatically collecting and cleaning Asian face data. The method includes the following steps: presetting a plurality of Asian target person identifiers, obtaining official photo links of the Asian target person, and constructing a plurality of Asian target person. A data list identifying key information; searching and obtaining reference character data associated with the Asian target character identification and keywords according to the content of the data list; storing the reference character data in association with the corresponding Asian target character identification and keywords; cleaning the stored With reference to the person data, the target face data associated with the Asian target person is obtained. The invention realizes the automatic collection and cleaning of Asian face data, replaces the traditional heavy processes such as manual labeling and classification, greatly reduces the manpower and time cost of establishing the Asian face database, and also solves the problem of different categories in the existing face database. balance issue.

Figure 201910977959

Description

一种亚洲人脸数据自动收集及清洗的方法和系统A method and system for automatic collection and cleaning of Asian face data

技术领域technical field

本发明涉及图像处理与识别技术领域,具体涉及一种亚洲人脸数据自动收集及清洗的方法和系统。The invention relates to the technical field of image processing and recognition, in particular to a method and system for automatically collecting and cleaning Asian face data.

背景技术Background technique

现有的大多数人脸识别技术是采用基于深度学习方法训练的人脸识别模型,为了提高识别的准确率,模型需要使用包含大量已标记的人脸照片数据库进行训练,人脸照片的来源主要是通过爬虫手段从互联网下载收集,之后需要通过繁杂的人工操作对照片进行标记和清洗,这一系列的工作不仅对计算和存储设备要求非常高,而且需要投入大量的人力和时间成本。互联网巨头公司由于其拥有独特的图片资源和运营资金优势,均设有私有的大型人脸数据集,但到目前为止普通用户能够免费获取的大型公开的人脸数据集非常少,主流的公开人脸数据集主要有Youtube Face、CASIA-WebFace和MS-1M-Celeb等。Most of the existing face recognition technologies are face recognition models trained based on deep learning methods. In order to improve the accuracy of recognition, the model needs to be trained using a database containing a large number of labeled face photos. The main source of face photos is It is downloaded and collected from the Internet through crawler means, and then the photos need to be marked and cleaned through complicated manual operations. This series of work not only requires very high computing and storage equipment, but also requires a lot of manpower and time costs. Due to their unique image resources and operating capital advantages, Internet giants have private large-scale face datasets, but so far there are very few large-scale public face datasets that ordinary users can obtain for free. The face datasets mainly include Youtube Face, CASIA-WebFace and MS-1M-Celeb.

更多的是,现存的大多数人脸数据集均以欧美人脸数据作为主导,其中仅包含极少量的亚洲人脸数据,使用这种类别不平衡的人脸数据集训练神经网络,容易导致网络潜在地存在“种族歧视”的问题。另外,目前为止能够在网上获取的亚洲人脸数据集非常少,且往往存在人物数量少、人脸数据量少等问题。因此,在有限的人力和时间成本前提下,建立一个大型的亚洲人脸数据集,具有十分重要的科研和商业价值。What's more, most of the existing face data sets are dominated by European and American face data, which only contain a very small amount of Asian face data. Using this type of unbalanced face data set to train neural networks will easily lead to There is a potential problem of "racism" on the Internet. In addition, there are very few Asian face data sets that can be obtained online so far, and there are often problems such as a small number of characters and a small amount of face data. Therefore, under the premise of limited manpower and time cost, establishing a large-scale Asian face dataset has great scientific and commercial value.

发明内容SUMMARY OF THE INVENTION

为了克服现有技术存在的缺陷与不足,本发明提供一种亚洲人脸数据自动收集及清洗的方法和系统,对收集到的亚洲人脸照片数据自动进行清洗,达到时间成本低、人均工作量少的效果,且能够建立召回率较高的亚洲人脸数据库。In order to overcome the defects and deficiencies of the prior art, the present invention provides a method and system for automatically collecting and cleaning Asian face data, which automatically cleans the collected Asian face photo data, so as to achieve low time cost and per capita workload. It has less effect and can build an Asian face database with a high recall rate.

为了达到上述目的,本发明采用以下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:

本发明提供一种亚洲人脸数据自动收集及清洗的方法,包括下述步骤:The present invention provides a method for automatically collecting and cleaning Asian face data, comprising the following steps:

预设多个亚洲目标人物标识,获取亚洲目标人物官方照片链接,构建数据列表,所述数据列表包括多个亚洲目标人物标识关键信息;Presetting a plurality of Asian target person identifiers, obtaining the official photo link of the Asian target person, and constructing a data list, the data list including the key information of the plurality of Asian target person identifiers;

根据数据列表内容,对所述亚洲目标人物标识、以及人物标识增加关键词后进行搜索,获取参考人物数据,具体步骤包括:According to the content of the data list, after adding keywords to the Asian target character identification and the character identification, a search is performed to obtain reference character data, and the specific steps include:

根据所述亚洲目标人物官方照片链接,获得亚洲目标人物的官方照片;According to the official photo link of the Asian target person, obtain the official photo of the Asian target person;

对所述亚洲目标人物标识进行搜索,获取单一标识情况下的相关联参考人物数据;Searching the Asian target person identification to obtain the associated reference person data in the case of a single identification;

对所述亚洲目标人物标识加上多种关键词进行搜索,分别获取单一标识组合上不同关键词对应的相关联参考人物数据;Searching the Asian target person identification plus a variety of keywords, respectively obtaining the associated reference person data corresponding to different keywords on a single identification combination;

将搜索后的参考人物数据与对应的亚洲目标人物标识、以及关键词进行关联存储,直到所有参考人物数据均关联存储完毕;Associate the searched reference person data with the corresponding Asian target person identifiers and keywords, until all reference person data are associated and stored;

清洗存储完毕的参考人物数据,得到与亚洲目标人物关联的目标人脸数据。Clean the stored reference character data to obtain the target face data associated with the Asian target character.

作为优选的技术方案,所述亚洲目标人物标识采用不同亚洲目标人物的名字或者用于区分不同亚洲目标人物的数字编号,所述亚洲目标人物官方照片链接采用URL链接,所述数据列表每一行对应一个亚洲目标人物标识及对应的官方照片URL链接。As a preferred technical solution, the Asian target person identification adopts the name of different Asian target person or the number number used to distinguish different Asian target person, the official photo link of the Asian target person adopts URL link, and each row of the data list corresponds to An Asian target person logo and corresponding official photo URL link.

作为优选的技术方案,所述将搜索后的参考人物数据与对应的亚洲目标人物标识、以及关键词进行关联存储,具体步骤为:As a preferred technical solution, the reference character data after the search is stored in association with the corresponding Asian target character identifiers and keywords, and the specific steps are:

创建主文件夹,所述主文件夹采用亚洲目标人物标识进行命名,在主文件夹内创建多个子文件夹,分别采用关键词进行命名;Create a main folder, the main folder is named by using the Asian target character identifier, and multiple sub-folders are created in the main folder, which are named by keywords;

将通过不同搜索方式获取的参考人物数据关联地保存在相应的子文件夹中。The reference character data obtained through different search methods are associated and saved in the corresponding subfolders.

作为优选的技术方案,所述清洗存储完毕的参考人物数据,具体步骤为:As a preferred technical solution, the specific steps of cleaning the stored reference character data are as follows:

采用图片处理工具对所有参考人物数据进行可读性检验和格式统一,剔除无法正常读写的参考人物数据;Use image processing tools to check the readability and unify the format of all reference character data, and eliminate reference character data that cannot be read and written normally;

删除获取到的重复下载参考人物数据,所述亚洲目标人物的官方照片保留;Delete the obtained data of repeated download reference characters, and keep the official photos of the Asian target characters;

初步清洗:采用人脸检测算法对参考人物数据进行人脸检测,得到经人脸检测处理后的参考人脸数据;Preliminary cleaning: use the face detection algorithm to perform face detection on the reference character data, and obtain the reference face data processed by face detection;

深度清洗:采用人脸识别算法检测经人脸检测处理后的参考人脸数据,更新亚洲目标人物标识官方照片列表,检验是否与亚洲目标人物标识官方照片列表匹配,若不匹配,则删除经人脸检测处理后的参考人脸数据,若匹配,则保留经人脸检测处理后的参考人脸数据,作为与亚洲目标人物关联的目标人脸数据。Deep cleaning: use the face recognition algorithm to detect the reference face data processed by face detection, update the official photo list of the Asian target person identification, and check whether it matches the official photo list of the Asian target person identification. If the reference face data after face detection processing matches, the reference face data after face detection processing will be retained as the target face data associated with the Asian target person.

作为优选的技术方案,所述删除获取到的重复下载参考人物数据,具体步骤为:As a preferred technical solution, the deletion of the obtained repeated download reference character data, the specific steps are:

参考人物数据与对应的亚洲目标人物标识、以及关键词关联存储到文件夹后,所述重复下载采用参考人物数据的文件命名是否相同为标准,若存在重复下载,保留重复下载参考人物数据中的其中一个,其余重复参考人物数据作删除处理。After the reference character data is associated with the corresponding Asian target character identification and the keyword and is stored in the folder, it is a standard whether the file naming of the reference character data is the same as the described repeated download. One of them, and the rest of the duplicate reference character data will be deleted.

作为优选的技术方案,所述采用人脸检测算法对参考人物数据进行人脸检测,具体步骤为:As a preferred technical solution, the use of a face detection algorithm to perform face detection on the reference character data, the specific steps are:

通过人脸检测算法定位参考人物数据中人脸关键点的位置并检测人脸框,Through the face detection algorithm, locate the position of the key points of the face in the reference character data and detect the face frame,

若不存在人脸框,则删除参考人物数据;If there is no face frame, delete the reference character data;

若存在一个人脸框,则裁剪出人脸框内的参考人脸数据,保留参考人物数据;If there is a face frame, the reference face data in the face frame is cut out, and the reference character data is retained;

若存在多个人脸框,则分别裁剪出多个人脸框内的各个参考人脸数据,同时保留参考人物数据,提取对应亚洲目标人物的官方照片和各个参考人脸数据的人脸特征,分别计算各个参考人脸数据的人脸特征与对应亚洲目标人物的官方照片的人脸特征的匹配度,将匹配度最高的参考人脸数据所对应的参考人物数据作为经人脸检测处理后的参考人脸数据保留。If there are multiple face frames, each reference face data in the multiple face frames is cut out, while the reference character data is retained, and the official photos corresponding to the Asian target characters and the face features of each reference face data are extracted, and calculated separately. The matching degree between the facial features of each reference face data and the facial features of the official photos corresponding to the Asian target person, and the reference person data corresponding to the reference face data with the highest matching degree is used as the reference person after face detection processing. Face data is preserved.

作为优选的技术方案,所述深度清洗的具体步骤为:As a preferred technical solution, the specific steps of the deep cleaning are:

基于人脸识别算法提取亚洲目标人物官方照片和对应初步清洗后的参考人脸数据的人脸特征;Extract the official photos of Asian target people and the facial features corresponding to the reference face data after preliminary cleaning based on the face recognition algorithm;

分别计算出亚洲目标人物官方照片的人脸特征与对应初步清洗后的参考人脸数据的人脸特征的匹配度,将匹配度大于或等于第一预设阈值的参考人脸数据归入目标人物官方照片列表,更新目标人物官方照片列表;Calculate the matching degree between the facial features of the official photos of the Asian target person and the facial features corresponding to the reference face data after preliminary cleaning, and classify the reference facial data whose matching degree is greater than or equal to the first preset threshold into the target person. Official photo list, update the official photo list of the target person;

将剩余初步清洗后参考人脸数据的人脸特征与更新后的亚洲目标人物官方照片列表中的人脸特征进行逐一匹配,保留匹配度大于或等于第二预设阈值的参考人脸数据,删除其余的参考人脸数据;Match the face features of the remaining reference face data after preliminary cleaning with the face features in the updated official photo list of Asian target people one by one, retain the reference face data whose matching degree is greater than or equal to the second preset threshold, and delete The rest of the reference face data;

所述第一预设阈值大于第二预设阈值。The first preset threshold is greater than the second preset threshold.

本发明还提供一种亚洲人脸数据自动收集及清洗的系统,包括:数据列表构建模块、参考人物数据获取模块、关联存储模块和参考人物数据清洗模块;The present invention also provides a system for automatically collecting and cleaning Asian face data, comprising: a data list building module, a reference character data acquisition module, an associated storage module and a reference character data cleaning module;

所述数据列表构建模块用于通过预设多个亚洲目标人物标识,获取亚洲目标人物官方照片链接,构建数据列表;The data list building module is used to obtain the official photo link of the Asian target person by presetting a plurality of Asian target person identifiers, and construct a data list;

所述参考人物数据获取模块用于根据数据列表内容获取与亚洲目标人物标识、以及人物标识关键词相关联的参考人物数据;The reference character data acquisition module is used to obtain reference character data associated with the Asian target character identifier and the character identifier keyword according to the content of the data list;

所述关联存储模块用于将参考人物数据与对应的亚洲目标人物标识、以及关键词进行关联存储;The associative storage module is used to associate and store the reference character data with the corresponding Asian target character identifiers and keywords;

所述参考人物数据清洗模块用于清洗存储完毕的参考人物数据,得到与亚洲目标人物关联的目标人脸数据。The reference character data cleaning module is used for cleaning the stored reference character data to obtain target face data associated with the Asian target character.

作为优选的技术方案,所述参考人物数据清洗模块包括初步清洗子模块和深度清洗子模块,所述初步清洗子模块用于采用人脸检测算法对参考人物数据进行人脸检测,得到经人脸检测处理后的参考人脸数据,所述深度清洗子模块用于采用人脸识别算法检测经人脸检测处理后的参考人脸数据,更新亚洲目标人物标识官方照片列表,检验是否与亚洲目标人物标识官方照片列表匹配,将匹配的参考人脸数据作为与亚洲目标人物关联的目标人脸数据。As a preferred technical solution, the reference character data cleaning module includes a preliminary cleaning sub-module and a deep cleaning sub-module, and the preliminary cleaning sub-module is used to perform face detection on the reference character data by using a face detection algorithm, and obtain a face Detecting the processed reference face data, the deep cleaning submodule is used to detect the reference face data processed by face detection using a face recognition algorithm, update the official photo list of the Asian target person identification, and check whether it matches the Asian target person. Identify the official photo list match, and use the matched reference face data as the target face data associated with the Asian target person.

本发明与现有技术相比,具有如下优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:

(1)本发明采用图片处理工具对所有参考人物数据进行可读性检验和格式统一,并删除重复下载的参考人物数据,提高后续清洗过程的流畅性和处理效率。(1) The present invention uses image processing tools to check the readability and unify the format of all reference character data, and delete the reference character data downloaded repeatedly, thereby improving the fluency and processing efficiency of the subsequent cleaning process.

(2)本发明采用多种搜索方式进行参考人物数据获取,增加获取参考人物数据的多样性和准确性。(2) The present invention adopts a variety of search methods to obtain reference character data, thereby increasing the diversity and accuracy of obtaining reference character data.

(3)本发明采用初步清洗和深度清洗进行数据清洗,得到与目标人物标识关联的目标人脸数据,并在深度清洗中更新亚洲目标人物标识官方照片列表,即更新参照的基准数据库,提高人脸特征比对的准确性。(3) The present invention uses preliminary cleaning and deep cleaning to clean the data, obtains the target face data associated with the target person identification, and updates the official photo list of the Asian target person identification in the deep cleaning, that is, updates the reference database of reference, improves the human The accuracy of face feature alignment.

(4)本发明从亚洲人脸数据的收集到清洗的整个过程均进行自动处理,代替了传统的手工标注、分类等繁重工序,大幅度地降低了建立亚洲人脸数据库的时间成本,也解决了现有人脸数据库中类别不平衡等问题。(4) The present invention performs automatic processing from the collection of Asian face data to the entire process of cleaning, which replaces the traditional heavy processes such as manual labeling and classification, greatly reduces the time cost of establishing the Asian face database, and also solves the problem of Problems such as category imbalance in the existing face database are solved.

附图说明Description of drawings

图1为本实施例的亚洲人脸数据自动收集及清洗方法流程示意图;1 is a schematic flow chart of the Asian face data automatic collection and cleaning method of the present embodiment;

图2为本实施例的亚洲人脸数据自动收集及清洗方法的数据列表示意图;2 is a schematic diagram of a data list of the Asian face data automatic collection and cleaning method of the present embodiment;

图3为本实施例的亚洲人脸数据自动收集及清洗方法中未进行清洗工作时的效果示意图;3 is a schematic diagram of the effect when cleaning is not performed in the Asian face data automatic collection and cleaning method of the present embodiment;

图4为本实施例亚洲人脸数据自动收集及清洗方法中初步清洗的效果示意图;4 is a schematic diagram of the effect of preliminary cleaning in the Asian face data automatic collection and cleaning method of the present embodiment;

图5为本实施例亚洲人脸数据自动收集及清洗方法中亚洲人脸数据收集的效果示意图。FIG. 5 is a schematic diagram of the effect of collecting Asian face data in the method for automatically collecting and cleaning Asian face data according to this embodiment.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

实施例Example

如图1所示,一种亚洲人脸数据自动收集及清洗的方法,包括下述步骤:As shown in Figure 1, a method for automatic collection and cleaning of Asian face data includes the following steps:

S1、预设多个亚洲目标人物标识,获取对应的官方照片链接,构建包括多个亚洲目标人物标识关键信息的数据列表;S1. Preset multiple Asian target person identifiers, obtain the corresponding official photo link, and construct a data list including key information of multiple Asian target person identifiers;

步骤S1中的多个亚洲目标人物标识,为不同亚洲目标人物的名字或预先设定的用于区分不同亚洲目标人物的数字编号;例如,百度公司提供了一个名为“百度百科明星人气榜”的榜单,该榜单包含“中国内地男明星榜”、“中国内地女明星榜”、“港台东南亚男明星榜”、“港台东南亚女明星榜”等子模块,可以通过爬虫技术自动获取选取的榜单名单,把榜单上属于亚洲明星子模块中的明星名字依次记录在一个数据列表上,可选地将以0为开始且逐渐递增的整数作为区分亚洲明星的数字标签;本实施例使用公众人物作为亚洲目标人物标识有两个好处,除了可以方便地在搜索引擎上通过标识获取大量指定目标人物的照片,同时还可以避免使用这些照片带来的任何隐私、侵权问题;The identifiers of multiple Asian target persons in step S1 are the names of different Asian target persons or a preset number number used to distinguish different Asian target persons; for example, Baidu provides a "Baidu Baike Star Popularity List". The list includes sub-modules such as "Mainland China Male Star List", "Mainland China Female Star List", "Hong Kong and Taiwan Southeast Asia Male Star List", "Hong Kong and Taiwan Southeast Asia Female Star List" and other sub-modules, which can be automatically Obtain the list of selected lists, record the names of the stars in the sub-module of Asian stars on the list in order on a data list, and optionally use an integer starting with 0 and gradually increasing as a digital label to distinguish Asian stars; The embodiment uses a public figure as an Asian target person identification to have two advantages, in addition to being able to easily obtain a large number of photos of the designated target person through the identification on the search engine, at the same time, any privacy and infringement problems caused by using these photos can be avoided;

步骤S1中的官方照片链接,在本实施例中采用百度百科提供下载的亚洲目标人物官方照片的URL链接。例如,“百度百科明星人气榜”会根据明星实时的人气量依次显示出明星的官方照片及名字,点击明星的官方照片或名字将会进入对应的百度百科介绍页面,通过爬虫技术可以获取该明星在页面中展示官方照片的URL链接,并将该链接记录在数据列表上。For the official photo link in step S1, in this embodiment, Baidu Encyclopedia is used to provide the URL link of the downloaded official photo of the Asian target person. For example, the "Baidu Baike Star Popularity List" will display the star's official photos and names in turn according to the star's real-time popularity. Clicking on the star's official photo or name will enter the corresponding Baidu Baike introduction page, and the star can be obtained through crawler technology. The URL link to the official photo is displayed on the page and the link is recorded on the data list.

如图2所示,在获取参考人物数据的列表示意图中,其中,参考人物数据包含多个亚洲目标人物标识关键信息,每一行对应一个亚洲目标人物标识及其对应的官方照片URL链接,从左往右依次是数字标签、姓名、URL链接,用制表符‘\t’作为间隔;As shown in Figure 2, in the schematic diagram of the list of obtaining reference character data, the reference character data includes key information of multiple Asian target character identifiers, and each row corresponds to an Asian target character identifier and its corresponding official photo URL link. To the right are numeric labels, names, and URL links, separated by tabs '\t';

S2、按照数据列表的内容,对所述亚洲目标人物标识、以及人物标识增加关键词后进行搜索,获取参考人物数据;S2, according to the content of the data list, search after adding keywords to the Asian target person identifier and the person identifier to obtain reference person data;

步骤S2具体步骤为:The specific steps of step S2 are:

根据亚洲目标人物官方照片的URL链接,使用计算机技术获得亚洲目标人物的官方照片,其中,计算机技术可选但不限于网络爬虫、下载器工具等;According to the URL link of the official photo of the Asian target person, use computer technology to obtain the official photo of the Asian target person, wherein the computer technology is optional but not limited to web crawler, downloader tools, etc.;

对亚洲目标人物标识进行搜索,并获取单一标识情况下相关联的至少一个参考人物数据;具体地,若亚洲目标人物标识是目标人物的姓名,那么参考人物数据是亚洲目标人物的相关照片,例如,本实施例可以采用使用Python脚本模拟人工搜索并下载图片的过程,通过在百度图片搜索引擎上进行亚洲目标人物标识搜索,可以轻易地获得大量参考人物数据;Search for the Asian target person identification, and obtain at least one reference person data associated with a single identification; specifically, if the Asian target person identification is the name of the target person, then the reference person data is the relevant photo of the Asian target person, such as , the present embodiment can adopt the process of using a Python script to simulate manual search and download pictures, and can easily obtain a large amount of reference character data by performing a search on the Asian target character identification on the Baidu image search engine;

本实施例还对目标人物标识加上关键词进行搜索,分别获取单一标识组合上不同关键词对应的相关联的至少一个参考人物数据,其中,关键词可选但不限于眼镜、帽子、演员、歌手(职业类)等,可以得到“目标人物姓名+眼镜”、“目标人物姓名+帽子”、“目标人物姓名+职业”等多种参考人物数据,可以增加获取参考人物数据的多样性和准确性;In this embodiment, the target person identifier plus keywords are also searched to obtain at least one associated reference character data corresponding to different keywords on a single identifier combination, wherein the keywords are optional but not limited to glasses, hats, actors, For singers (professional), etc., various reference character data such as "target character name + glasses", "target character name + hat", "target character name + occupation" can be obtained, which can increase the diversity and accuracy of the obtained reference character data. sex;

S3、依次将至少一个参考人物数据与对应的亚洲目标人物标识,以及加上关键词的数据进行关联存储,直到所有参考人物数据均关联存储完毕;S3, successively associate and store at least one reference character data with the corresponding Asian target character identifier and the data added with the keyword, until all the reference character data are associated and stored;

具体地,分别对数据列表中每个亚洲目标人物标识进行照片收集及存储,例如:Specifically, photos are collected and stored for each Asian target person identification in the data list, for example:

创建主文件夹并以亚洲目标人物标识(比如亚洲目标人物对应的数字标签)进行命名,然后在主文件夹内创建多个子文件夹,并分别以关键词(比如关键词的英文拼写)进行命名;Create a main folder and name it with the Asian target person identifier (such as the digital label corresponding to the Asian target person), then create multiple subfolders in the main folder and name them with keywords (such as the English spelling of keywords) ;

将通过不同搜索方式获取的亚洲目标人物的至少一个参考人物数据关联地保存在相应的子文件夹中;Save at least one reference character data of Asian target characters obtained through different search methods in a corresponding subfolder;

S4、依次自动清洗存储完毕的参考人物数据,得到与亚洲目标人物关联的目标人脸数据;S4, automatically cleaning the stored reference character data in turn to obtain target face data associated with the Asian target character;

步骤S4具体步骤包括:The specific steps of step S4 include:

S41、使用图片处理工具对所有参考人物数据进行可读性检验和格式统一,剔除由于下载错误、格式错误等原因导致的无法正常读写的小部分参考人物数据,其中,图片处理工具可选但不限于MATLAB、Python、OpenCV、Photoshop等图像处理软件或编程语言;例如,在通过爬虫方法从互联网批量下载图片时,往往会受到网络波动、反爬虫机制的影响,导致下载的图片内容不完全、下载错误等问题,这类图片一般不能够被软件正常读写,会严重影响清洗数据的流畅性,应予以剔除,提升数据处理效率;另外,为了后续处理、管理数据的方便,在数据清洗工作开始之前,可选用MATLAB将所有参考人物数据统一成常见的JEPG格式;S41. Use an image processing tool to check the readability and unify the format of all reference character data, and remove a small part of the reference character data that cannot be read and written normally due to download errors, format errors, etc. The image processing tool is optional but Not limited to MATLAB, Python, OpenCV, Photoshop and other image processing software or programming languages; for example, when batch downloading pictures from the Internet through the crawler method, it is often affected by network fluctuations and anti-crawling mechanisms, resulting in incomplete content of the downloaded pictures, Download errors and other problems, such pictures generally cannot be read and written by the software normally, which will seriously affect the smoothness of the cleaned data and should be eliminated to improve the efficiency of data processing; in addition, for the convenience of subsequent processing and data management, in the data cleaning work Before starting, MATLAB can be used to unify all reference character data into a common JPEG format;

S42、删除亚洲目标人物在使用不同搜索方式下获取的重复下载的参考人物数据,亚洲目标人物的官方照片保留;S42. Delete the repeatedly downloaded reference data of the Asian target person obtained by using different search methods, and retain the official photo of the Asian target person;

其中,重复下载是以参考人物数据的文件命名相同为标准;例如,搜索某公众人物,先创建主文件夹,搜索方式可选的有“某人名、某人名+帽子、某人名+眼镜、某人名+歌手”等,因此在主文件夹内分别创建若干子文件夹name、hat、glass、job等,官方照片保存在standard子文件夹内,通过不同的组合方式在搜索引擎获取参考人物数据,该引擎不可避免的返回部分命名和内容均一致的参考人物数据,这种重复数据容易导致神经网络的过拟合,严重影响网络的人脸识别性能,所以这一步的目的是针对子文件夹内的所有命名相同的文件,删除重复的仅保留一个即可;本实施例的standard子文件夹内官方照片保留,可以采用将URL下载下来的官方照片命名为standard.jpg,其他批量搜索下载得到的照片保留源网络的图片命名,这样可以使得standard子文件夹内官方照片命名不与其他图片重复;或者也可以不更改standard子文件夹内官方照片命名,增加一个判断环节,若重复图片的其中一张位于standard子文件夹,则保留standard子文件里面的图片。Among them, the repeated download is based on the same file name as the reference data; for example, to search for a public figure, first create a main folder, and the search methods can be "someone's name, one's name + hat, one's name + glasses, one's name Name + singer", etc., so create several subfolders name, hat, glass, job, etc. in the main folder, save the official photos in the standard subfolder, and obtain reference character data in the search engine through different combinations. The engine inevitably returns some reference character data with the same name and content. This kind of repeated data can easily lead to overfitting of the neural network and seriously affect the face recognition performance of the network. Therefore, the purpose of this step is to target the subfolders in All the files with the same name, delete the duplicate and keep only one; the official photos in the standard subfolder of this embodiment are retained, and the official photos downloaded from the URL can be named standard. The photo retains the image naming of the source network, so that the official photo naming in the standard subfolder does not repeat with other images; or the official photo naming in the standard subfolder can not be changed, and a judgment link is added. If one of the images is repeated The picture is located in the standard subfolder, and the pictures in the standard subfile are kept.

步骤S42的具体步骤包括:The specific steps of step S42 include:

运用程序依次记录与亚洲目标人物关联的子文件夹中的参考人物数据的文件命名,其中,程序可选但不限于使用MATLAB、Python等语言编写;Use the program to sequentially record the file naming of the reference person data in the subfolder associated with the Asian target person, wherein the program is optional but not limited to be written in MATLAB, Python and other languages;

若存在重复下载的情况,仅保留其中一个参考人物数据,其余重复的参考人物数据作删除处理;If there are duplicate downloads, only one of the reference character data will be retained, and the remaining duplicate reference character data will be deleted;

S43、基于人脸检测算法对参考人物数据进行人脸检测工作,以使清洗后的参考人物数据仅包括参考人脸数据,其中,人脸检测工作包含人脸检测、人脸校正、人脸对齐等系列步骤,清洗前的参考人物数据包括参考人脸数据和参考非人脸数据;S43. Perform face detection work on the reference character data based on the face detection algorithm, so that the cleaned reference character data only includes the reference face data, wherein the face detection work includes face detection, face correction, and face alignment and other series of steps, the reference character data before cleaning includes reference face data and reference non-face data;

步骤S43的具体步骤包括:The specific steps of step S43 include:

基于人脸检测算法对参考人物数据进行人脸检测工作,得到经人脸检测处理后的参考人脸数据,其中,人脸检测算法可以采用但不限于MTCNN等深度学习方法,MTCNN等深度学习方法包括人脸检测、人脸校正、人脸对齐等系列步骤,其中,人脸检测可以剔除掉一些非人脸数据,如仅仅是眼镜、帽子之类的图片,人脸矫正和人脸对齐能够将侧偏的人脸纠正并对齐,提高后续人脸特征匹配的处理效率。Perform face detection on the reference character data based on the face detection algorithm, and obtain the reference face data processed by face detection. The face detection algorithm may adopt but not limited to deep learning methods such as MTCNN, and deep learning methods such as MTCNN. It includes a series of steps such as face detection, face correction, and face alignment. Among them, face detection can remove some non-face data, such as pictures such as glasses and hats, and face correction and face alignment can The laterally skewed faces are corrected and aligned to improve the processing efficiency of subsequent face feature matching.

如图3所示,在没进行清洗工作时,参考人物数据包括参考人脸数据和参考非人脸数据,本实施例进行的初步清洗工作是剔除参考人物数据中的参考非人脸数据,常用的方法有MTCNN深度学习方法及OpenCV软件中自带的人脸检测工具包,具体原理是,通过算法检测并定位照片中人脸五个关键点(双眼、鼻子、嘴角)的位置并返回人脸框,可根据是否返回了人脸框来判断该参考人物数据是否是参考人脸数据;如果没有返回人脸框,则删除该照片;如果返回一个人脸框,则裁剪出一个人脸框内的参考人脸数据,保留该照片;如果返回多于一个人脸框,则分别裁剪出多于一个人脸框内的参考人脸数据,同时保留经人脸检测处理的参考人物数据,再基于人脸识别算法挑选出与亚洲目标人物官方照片匹配度最高的一个参考人脸数据,其余的参考人脸数据作删除处理;As shown in FIG. 3 , when the cleaning work is not performed, the reference character data includes reference face data and reference non-face data. The preliminary cleaning work performed in this embodiment is to remove the reference non-face data in the reference character data. Commonly used The methods include the MTCNN deep learning method and the face detection toolkit that comes with the OpenCV software. The specific principle is to detect and locate the five key points (eyes, nose, mouth corners) of the face in the photo through the algorithm and return the face. frame, you can judge whether the reference person data is the reference face data according to whether the face frame is returned; if the face frame is not returned, delete the photo; if a face frame is returned, cut out a face frame If more than one face frame is returned, the reference face data in more than one face frame will be cut out, and the reference face data processed by face detection will be retained, and then based on The face recognition algorithm selects the reference face data with the highest matching degree with the official photo of the Asian target person, and deletes the rest of the reference face data;

进一步地,针对上述人脸检测算法返回多于一个人脸框的情况,需要首先创建以照片名称为命名的文件夹,再分别将照片中各人脸框中的人脸裁剪出来,并以可区分方式依次保存到文件夹当中,然后,基于人脸识别算法提取该照片对应亚洲目标人物的官方照片和文件夹中各人脸框照片的人脸特征,分别计算各人脸框照片的人脸特征与对应亚洲目标人物官方照片的人脸特征的匹配度,保留匹配度最高的一张人脸框照片,其余的人脸框照片作删除处理,将保留的照片移出文件夹并替代原参考人物数据,同时删除文件夹,如图4所示,可以看出经过初步清洗后极大部分参考非人脸数据已被剔除;Further, in view of the situation that the above-mentioned face detection algorithm returns more than one face frame, it is necessary to first create a folder named after the photo name, and then cut out the faces in each face frame in the photo, and use the available The differentiating methods are stored in the folder in turn, and then, based on the face recognition algorithm, the official photo of the photo corresponding to the Asian target person and the face features of each face frame photo in the folder are extracted, and the face of each face frame photo is calculated separately. The matching degree between the features and the facial features corresponding to the official photos of the Asian target person, keep the face frame photo with the highest matching degree, and delete the remaining face frame photos, move the retained photos out of the folder and replace the original reference person data, and delete the folder at the same time, as shown in Figure 4, it can be seen that most of the reference non-face data has been eliminated after preliminary cleaning;

S44、基于人脸识别算法检测初步清洗后的参考人脸数据是否与亚洲目标人物标识官方照片匹配,若不匹配则删除初步清洗后的参考人脸数据,若匹配则保留初步清洗后的参考人脸数据,以得到与目标人物关联的目标人脸数据;具体地,对参考人物数据进行初步清洗工作后,并不能保证获得的参考人脸数据均属于目标人物,因此需要对获得的参考人脸数据进行深度清理工作,即保留与亚洲目标人物标识相同的目标人脸数据,删除与亚洲目标人物标识不同的目标人脸数据,从而完成最终的数据清洗工作;S44 , based on the face recognition algorithm, detect whether the reference face data after preliminary cleaning matches the official photo of the Asian target person identification, if not, delete the reference face data after preliminary cleaning, and retain the reference person after preliminary cleaning if they match face data in order to obtain the target face data associated with the target person; specifically, after preliminary cleaning of the reference person data, there is no guarantee that the obtained reference face data belong to the target person, so it is necessary to analyze the obtained reference face data. Perform in-depth data cleaning, that is, retain the target face data with the same identification as the Asian target person, delete the target face data different from the Asian target person identification, so as to complete the final data cleaning work;

步骤S44的具体步骤包括:The specific steps of step S44 include:

基于人脸识别算法提取亚洲目标人物官方照片和其对应清洗后的参考人脸数据的人脸特征;Extract the facial features of the official photos of Asian target people and their corresponding cleaned reference face data based on the face recognition algorithm;

分别计算出亚洲目标人物官方照片的人脸特征与初步清洗后的参考人脸数据的人脸特征的匹配度,将匹配度大于或等于第一预设阈值的清洗后的参考人脸数据归入目标人物官方照片列表,此时目标人物官方照片列表不仅仅是官方照片,还包括目标人物添加关键词搜索得到的匹配度高的目标人物照片,因为在通过不同搜索方式获取的亚洲目标人物的人脸数据中,通过关键词搜索得到的亚洲目标人物的人脸数据,与仅通过步骤S2亚洲目标人物官方照片的URL链接得到的目标人物官方照片进行人脸特征比对,容易出现匹配度低的情况,比如官方照片为正面人脸照片,未装饰有帽子、眼镜之类的饰品,通过添加关键词(如帽子、眼镜)搜索到的照片(进行人脸识别筛选后)与未装饰有帽子、眼镜之类的饰品的官方照片进行比对,容易出现匹配偏差的情况,因此,需要将目标人物添加关键词搜索得到的匹配度高的目标人物照片添加到目标人物官方照片列表,更新目标人物官方照片列表,以增加人脸特征比对的准确性;Calculate the degree of matching between the facial features of the official photos of the Asian target person and the facial features of the initially cleaned reference facial data, and classify the cleaned reference facial data whose matching degree is greater than or equal to the first preset threshold into the The official photo list of the target person. At this time, the official photo list of the target person is not only the official photo, but also includes the target person's photo with a high degree of matching obtained by adding a keyword to the target person. In the face data, the face data of the Asian target person obtained through the keyword search is compared with the official photo of the target person obtained only through the URL link of the official photo of the Asian target person in step S2, and it is easy to appear that the matching degree is low. For example, the official photo is a frontal face photo without accessories such as hats and glasses, and the photos searched by adding keywords (such as hats, glasses) (after face recognition screening) are not decorated with hats, glasses, etc. Comparing the official photos of accessories such as glasses is prone to match deviation. Therefore, it is necessary to add the target person's photo with a high degree of matching obtained by adding a keyword search to the target person's official photo list and update the target person's official photo list. List of photos to increase the accuracy of face feature comparison;

将剩余初步清洗后的参考人脸数据与更新后的目标人物官方照片列表中数据的进行人脸特征逐一匹配,保留匹配度大于或等于第二预设阈值的剩余初步清洗后的参考人脸数据,其余的剩余初步清洗后的参考人脸数据作删除处理。Match the remaining preliminarily cleaned reference face data with the updated face features of the data in the official photo list of the target person one by one, and retain the remaining preliminarily cleaned reference face data whose matching degree is greater than or equal to the second preset threshold , and the remaining reference face data after preliminary cleaning is deleted.

在本实施例中,用人脸识别算法依次提取亚洲目标人物对应初步清洗后的参考人脸数据的特征向量,同时把第一预设阈值设定为0.9,然后将参考人脸数据的特征向量与官方照片的特征向量进行一一匹配,首轮可将匹配度大于或等于0.9的参考人脸数据筛选出来,且可认为这些照片均为该目标人物的官方照片;然后,把第二预设阈值设定为0.7,再将剩余的参考人脸数据的特征向量与第一轮筛选出的官方照片队列中照片的特征向量进行一一匹配,与任何一张官方照片队列中照片的特征向量匹配度只要大于或等于0.7,则保留该剩余参考人脸数据,否则作删除处理;实施例中使用的第一预设阈值和第二预设阈值可根据实际情况进行调整。In this embodiment, a face recognition algorithm is used to sequentially extract the feature vector of the Asian target person corresponding to the initially cleaned reference face data, and at the same time, the first preset threshold is set to 0.9, and then the feature vector of the reference face data and The feature vectors of the official photos are matched one by one. In the first round, the reference face data with a matching degree greater than or equal to 0.9 can be screened out, and these photos can be considered as the official photos of the target person; then, the second preset threshold value Set it to 0.7, and then match the feature vector of the remaining reference face data with the feature vector of the photos in the official photo queue selected in the first round, and match the feature vector of any photo in the official photo queue. As long as it is greater than or equal to 0.7, the remaining reference face data is retained, otherwise it is deleted; the first preset threshold and the second preset threshold used in the embodiment can be adjusted according to actual conditions.

如图5所示,经过两轮的清洗工作后,可以获得纯度较高的亚洲人脸数据库,每个亚洲目标人物标识的文件夹中的极大部分照片均属于目标人物,噪声照片极少。As shown in Figure 5, after two rounds of cleaning, a high-purity Asian face database can be obtained. Most of the photos in the folder identified by each Asian target person belong to the target person, and there are very few noise photos.

本实施例还提供一种亚洲人脸数据自动收集及清洗的系统,包括:数据列表构建模块、参考人物数据获取模块、关联存储模块和参考人物数据清洗模块;This embodiment also provides a system for automatically collecting and cleaning Asian face data, including: a data list building module, a reference character data acquisition module, an associated storage module, and a reference character data cleaning module;

在本实施例中,包括:数据列表构建模块、参考人物数据获取模块、关联存储模块和参考人物数据清洗模块;In this embodiment, it includes: a data list building module, a reference character data acquisition module, an associated storage module and a reference character data cleaning module;

在本实施例中,数据列表构建模块用于通过预设多个亚洲目标人物标识,获取亚洲目标人物官方照片链接,构建数据列表;参考人物数据获取模块用于根据数据列表内容获取与亚洲目标人物标识、以及人物标识关键词相关联的参考人物数据;关联存储模块用于将参考人物数据与对应的亚洲目标人物标识、以及关键词进行关联存储;参考人物数据清洗模块用于清洗存储完毕的参考人物数据,得到与亚洲目标人物关联的目标人脸数据。In this embodiment, the data list building module is used to obtain the official photo link of the Asian target person by presetting multiple identifiers of the Asian target person, and construct a data list; the reference person data acquisition module is used to obtain information related to the Asian target person according to the content of the data list. The reference character data associated with the identification and character identification keywords; the associative storage module is used to associate and store the reference character data with the corresponding Asian target character identification and keywords; the reference character data cleaning module is used to clean the stored reference Person data, get the target face data associated with the Asian target person.

在本实施例中,参考人物数据清洗模块包括初步清洗子模块和深度清洗子模块,初步清洗子模块用于采用人脸检测算法对参考人物数据进行人脸检测,得到经人脸检测处理后的参考人脸数据,深度清洗子模块用于采用人脸识别算法检测经人脸检测处理后的参考人脸数据,更新亚洲目标人物标识官方照片列表,检验是否与亚洲目标人物标识官方照片列表匹配,将匹配的参考人脸数据作为与亚洲目标人物关联的目标人脸数据。In this embodiment, the reference character data cleaning module includes a preliminary cleaning sub-module and a deep cleaning sub-module, and the preliminary cleaning sub-module is used to perform face detection on the reference character data by using a face detection algorithm, and obtain a face detection processing. With reference to face data, the deep cleaning sub-module is used to detect the reference face data processed by face detection using the face recognition algorithm, update the official photo list of the Asian target person identification, and check whether it matches the official photo list of the Asian target person identification. Use the matched reference face data as the target face data associated with the Asian target person.

本实施例从亚洲人脸数据的收集到清洗的整个过程,均通过自动处理代替了传统的手工标注、分类等繁重工序,大幅度地降低了建立亚洲人脸数据库的时间成本,也解决了现有人脸数据库中类别不平衡等问题,促进相应技术的发展进步。In this embodiment, the entire process from the collection of Asian face data to the cleaning process replaces the traditional heavy processes such as manual labeling and classification through automatic processing, which greatly reduces the time and cost of establishing the Asian face database, and also solves the problem of existing problems. There are problems such as category imbalance in the face database, which promotes the development and progress of corresponding technologies.

上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, The simplification should be equivalent replacement manners, which are all included in the protection scope of the present invention.

Claims (9)

1. A method for automatically collecting and cleaning Asian face data is characterized by comprising the following steps:
presetting a plurality of Asian target character identifications, acquiring an official photo link of the Asian target character, and constructing a data list, wherein the data list comprises a plurality of Asian target character identification key information;
according to the content of the data list, after keywords are added to the Asian target character identification and the character identification, searching is carried out to obtain reference character data, and the specific steps comprise:
obtaining an official photo of the Asian target character according to the Asian target character official photo link;
searching the Asian target character identification to obtain associated reference character data under the condition of single identification;
adding a plurality of keywords to the Asian target character identification for searching, and respectively acquiring associated reference character data corresponding to different keywords on a single identification combination;
performing associated storage on the searched reference character data, the corresponding Asian target character identification and the keywords until all the reference character data are stored in an associated manner;
and cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
2. The method of claim 1, wherein said asian target character identifiers are different asian target character names or different asian target character numbers, said asian target character official photo links are URL linked, and each row of said data list corresponds to an asian target character identifier and a corresponding official photo URL link.
3. The Asian face data automatic collection and cleaning method as claimed in claim 1, wherein the storing of the searched reference character data in association with the corresponding Asian target character identifier and keywords comprises the following specific steps:
creating a main folder, wherein the main folder is named by adopting an Asian target character identifier, and a plurality of subfolders are created in the main folder and are named by adopting keywords respectively;
and storing the reference person data acquired by different searching modes in corresponding subfolders in a related manner.
4. The Asian face data automatic collection and cleaning method as claimed in claim 1, wherein the cleaning of the stored reference character data comprises the following steps:
performing readability inspection and format unification on all reference character data by adopting a picture processing tool, and removing the reference character data which cannot be read and written normally;
deleting the obtained repeatedly downloaded reference character data, and reserving official photos of the Asian target characters;
primary cleaning: performing face detection on the reference figure data by adopting a face detection algorithm to obtain reference face data subjected to face detection processing;
deep cleaning: and detecting reference face data after face detection processing by adopting a face recognition algorithm, updating an Asian target figure identification official photo list, checking whether the reference face data are matched with the Asian target figure identification official photo list, if not, deleting the reference face data after face detection processing, and if so, keeping the reference face data after face detection processing as target face data associated with the Asian target figure.
5. The Asian face data automatic collection and cleaning method according to claim 4, wherein the deleting of the repeatedly downloaded reference character data comprises the following specific steps:
and after the reference character data, the corresponding Asian target character identification and the keywords are stored in a folder in an associated mode, the repeated downloading adopts the standard that whether the file names of the reference character data are the same or not, if the repeated downloading exists, one of the repeatedly downloaded reference character data is reserved, and the rest repeatedly downloaded reference character data is deleted.
6. The Asian face data automatic collection and cleaning method as claimed in claim 4, wherein the face detection of the reference person data by the face detection algorithm comprises the following steps:
positioning the positions of key points of the human face in the reference person data through a human face detection algorithm and detecting a human face frame;
if the face frame does not exist, deleting the reference character data;
if a face frame exists, cutting out reference face data in the face frame, and keeping the reference character data;
if a plurality of face frames exist, cutting out each reference face data in the plurality of face frames respectively, simultaneously retaining the reference character data, extracting the official photo corresponding to the Asian target character and the face features of each reference face data, respectively calculating the matching degree of the face features of each reference face data and the face features of the official photo corresponding to the Asian target character, and retaining the reference character data corresponding to the reference face data with the highest matching degree as the reference face data after face detection processing.
7. The Asian face data automatic collection and cleaning method as claimed in claim 4, wherein the deep cleaning comprises the following specific steps:
extracting the official photos of the Asian target characters and the face features of the reference face data corresponding to the preliminarily cleaned Asian target characters based on a face recognition algorithm;
respectively calculating the matching degree of the face features of the Asian target character official photo and the face features of the reference face data corresponding to the preliminarily cleaned face features, classifying the reference face data with the matching degree being greater than or equal to a first preset threshold value into a target character official photo list, and updating the target character official photo list;
matching the face features of the reference face data after the residual preliminary cleaning with the face features in the updated Asian target character official photo list one by one, keeping the reference face data with the matching degree larger than or equal to a second preset threshold value, and deleting the rest reference face data;
the first preset threshold is greater than a second preset threshold.
8. An Asian face data automatic collection and cleaning system, comprising: the system comprises a data list construction module, a reference character data acquisition module, an association storage module and a reference character data cleaning module;
the data list building module is used for obtaining Asian target character official photo links by presetting a plurality of Asian target character identifications and building a data list;
the reference character data acquisition module is used for acquiring reference character data associated with the Asian target character identification and the character identification key words according to the content of the data list;
the association storage module is used for performing association storage on the reference character data, the corresponding Asian target character identification and the keywords;
the reference character data cleaning module is used for cleaning the stored reference character data to obtain target face data associated with the Asian target characters.
9. The Asian face data automatic collection and cleaning system according to claim 8, wherein the reference character data cleaning module comprises a preliminary cleaning sub-module and a deep cleaning sub-module, the preliminary cleaning sub-module is used for performing face detection on the reference character data by adopting a face detection algorithm to obtain the reference face data after face detection processing, the deep cleaning sub-module is used for detecting the reference face data after face detection processing by adopting a face recognition algorithm, updating the Asian target character identification official photo list, checking whether the reference face data are matched with the Asian target character identification official photo list, and taking the matched reference face data as the target face data associated with the Asian target character.
CN201910977959.3A 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system Pending CN110807108A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910977959.3A CN110807108A (en) 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system
PCT/CN2020/070658 WO2021072998A1 (en) 2019-10-15 2020-01-07 Method and system for automatic collection and cleaning of asian face data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977959.3A CN110807108A (en) 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system

Publications (1)

Publication Number Publication Date
CN110807108A true CN110807108A (en) 2020-02-18

Family

ID=69488429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977959.3A Pending CN110807108A (en) 2019-10-15 2019-10-15 Asian face data automatic collection and cleaning method and system

Country Status (2)

Country Link
CN (1) CN110807108A (en)
WO (1) WO2021072998A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680202A (en) * 2020-04-24 2020-09-18 烽火通信科技股份有限公司 Body-based face image data collection method and device
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083572B (en) * 2022-07-25 2023-07-21 广州思德医疗科技有限公司 Picture storing and extracting method, system, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984738A (en) * 2014-05-22 2014-08-13 中国科学院自动化研究所 Role labelling method based on search matching
US20170083755A1 (en) * 2014-06-16 2017-03-23 Beijing Sensetime Technology Development Co., Ltd Method and a system for face verification
CN106844412A (en) * 2016-11-02 2017-06-13 厦门中控生物识别信息技术有限公司 A kind of human face data collection method and device
CN108319938A (en) * 2017-12-31 2018-07-24 奥瞳系统科技有限公司 High quality training data preparation system for high-performance face identification system
CN109063784A (en) * 2018-08-23 2018-12-21 深圳码隆科技有限公司 A kind of character costume image data screening technique and its device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102938065B (en) * 2012-11-28 2017-10-20 北京旷视科技有限公司 Face feature extraction method and face identification method based on large-scale image data
CN106874898B (en) * 2017-04-08 2021-03-30 复旦大学 Large-scale face recognition method based on deep convolutional neural network model
CN109241310B (en) * 2018-07-25 2020-05-01 南京甄视智能科技有限公司 Data duplication removing method and system for human face image database
CN109034106B (en) * 2018-08-15 2022-06-10 北京小米移动软件有限公司 Face data cleaning method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984738A (en) * 2014-05-22 2014-08-13 中国科学院自动化研究所 Role labelling method based on search matching
US20170083755A1 (en) * 2014-06-16 2017-03-23 Beijing Sensetime Technology Development Co., Ltd Method and a system for face verification
CN106844412A (en) * 2016-11-02 2017-06-13 厦门中控生物识别信息技术有限公司 A kind of human face data collection method and device
CN108319938A (en) * 2017-12-31 2018-07-24 奥瞳系统科技有限公司 High quality training data preparation system for high-performance face identification system
CN109063784A (en) * 2018-08-23 2018-12-21 深圳码隆科技有限公司 A kind of character costume image data screening technique and its device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680202A (en) * 2020-04-24 2020-09-18 烽火通信科技股份有限公司 Body-based face image data collection method and device
CN111680202B (en) * 2020-04-24 2022-04-26 烽火通信科技股份有限公司 Body-based face image data collection method and device
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data

Also Published As

Publication number Publication date
WO2021072998A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
CN110704411B (en) Knowledge graph building method and device suitable for art field and electronic equipment
US8649572B2 (en) System and method for enabling the use of captured images through recognition
US8630513B2 (en) System and method for providing objectified image renderings using recognition information from images
US9069795B2 (en) System and process for building a catalog using visual objects
CN103793697B (en) The identity mask method and face personal identification method of a kind of facial image
WO2023108980A1 (en) Information push method and device based on text adversarial sample
US20060251292A1 (en) System and method for recognizing objects from images and identifying relevancy amongst images and information
CN112101335A (en) APP violation monitoring method based on OCR and transfer learning
CN103824053A (en) Face image gender marking method and face gender detection method
CN102831239B (en) A kind of method and system building image data base
Bharati et al. Beyond pixels: Image provenance analysis leveraging metadata
CN110807108A (en) Asian face data automatic collection and cleaning method and system
CN110196945B (en) A Weibo User Age Prediction Method Based on LSTM and LeNet Fusion
CN111753120B (en) Question searching method and device, electronic equipment and storage medium
CN113343012A (en) News matching method, device, equipment and storage medium
CN114969467A (en) Data analysis and classification method and device, computer equipment and storage medium
CN110737687A (en) Data query method, device, equipment and storage medium
CN113761242A (en) A big data image recognition system and method based on artificial intelligence
WO2021114634A1 (en) Text annotation method, device, and storage medium
CN114936840A (en) Intelligent identification method for power business work order information based on image classification and OCR technology
WO2020235862A1 (en) Image manipulation
CN111950352A (en) Hierarchical face clustering method, system, device and storage medium
CN112597862B (en) Method and equipment for cleaning face data
CN112364790B (en) Airport work order information identification method and system based on convolutional neural network
CN110852359B (en) Genealogy recognition method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200218