CN106649610A - Image labeling method and apparatus - Google Patents
Image labeling method and apparatus Download PDFInfo
- Publication number
- CN106649610A CN106649610A CN201611073445.8A CN201611073445A CN106649610A CN 106649610 A CN106649610 A CN 106649610A CN 201611073445 A CN201611073445 A CN 201611073445A CN 106649610 A CN106649610 A CN 106649610A
- Authority
- CN
- China
- Prior art keywords
- picture
- data
- internet
- mark
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
Abstract
The invention relates to an image labeling method and apparatus. The method comprises the following steps: acquiring Internet image data according to a target task requirement; performing data cleaning on the acquired Internet image data; carrying out image labeling according to the cleaned Internet image data, and receiving a corresponding finishing result after image labeling; and generating a labeling data set according to the finishing result. The apparatus includes an image acquisition unit, an image cleaning unit, an image labeling unit, and a data set generation unit. According to the image labeling method and apparatus of the invention, the quantity and quality of data to be labeled and labeling speed can be improved; the purpose of outputting high-quality data labeled results with fast speed and low cost is achieved; and an effective training data set can be provided for subsequent model training.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of picture mask method and device.
Background technology
With the popularization of internet and intelligent terminal, the image data on internet is more and more also increasingly to be enriched.Such as
What effectively utilizes Internet picture data, and form training sample to complete phase by the acquisition process to these image datas
Shut down study and deep learning task become when previous important problem.
At present, in order to the sample for forming training pattern is to find increased income on the net available picture number generally using way
Directly use according to collection, or oneself go collection picture concerned data of taking pictures, then examination & verification is screened one by one to these data, most
The data acquisition system that can be used to train is generated afterwards.
Obviously, to there is data type relatively single for said method, and gathering speed is slow, and data volume is little, and process cycle is long
The problems such as.
The content of the invention
For the method for the sample of existing formation training pattern, to there is data type single, and gathering speed is slow, data volume
Little, high cost, defect the problems such as process cycle is long, the present invention proposes following technical scheme:
One aspect of the present invention provides a kind of picture mask method, including:
According to goal task Requirement Acquisition Internet picture data;
Internet picture data to obtaining carry out data cleansing;
Picture mark is carried out according to the Internet picture data after cleaning, and receive and corresponding after picture mark complete knot
Really;
Labeled data collection is generated according to the result that completes.
Alternatively, it is described according to goal task Requirement Acquisition Internet picture data, including:
The Internet picture data are obtained by default conventional data resource platform or the other vertical class website of target class.
Alternatively, methods described also includes:
It is mutual by what is captured by presetting identical picture and similar pictures searching algorithm when obtaining the Internet picture data
Networking image data is compared with locally stored image data;
The image data for repeating is abandoned according to comparison result, and, the picture resource to not appearing in local enters
Row downloads in-stockroom operation.
Alternatively, the Internet picture data of described pair of acquisition carry out data cleansing, including:
Using computer vision and deep learning treatment technology, to the Internet picture data that preserve from content and semantic class
Data cleansing is not carried out, to filter the image data for not meeting preset requirement.
Alternatively, the utilization computer vision and deep learning treatment technology, to preserve Internet picture data from
Content and semantic level carry out data cleansing, including:
The object in the Internet picture is recognized, and the Internet picture is labelled according to the content for identifying,
To filter the image data for not meeting preset requirement according to the result for labelling;
Recognize the content of the Internet picture, and generate the phrase of the content for describing the Internet picture, with basis
The result of phrase generation filters the image data for not meeting preset requirement;
The significance of the object in the Internet picture is detected, the picture entirely without significant characteristics is filtered;
The entity number occurred in the Internet picture is detected, picture of the entity number more than predetermined number is filtered.
Alternatively, the significance of the object in the detection Internet picture, including:
The brightness of pixel in the Internet picture, contrast index are analyzed, and according to the Grad and statistics of pixel
Learn the salient region that principle determines the Internet picture.
Alternatively, the Internet picture data according to after cleaning carry out picture mark, including:
According to the Internet picture data genaration mark candidate data set after cleaning, and according to the mark candidate data
Set determines current task to be marked;
According to default labeling system background task allocation algorithm, the current task to be marked is carried out according to pre-set level
Mark.
Alternatively, the result that completes according to the mark task is generated before labeled data collection, and methods described is also wrapped
Include:
That audits the mark task completes result;
Correspondingly, after examination & verification success, labeled data collection is generated according to the result that completes of the mark task.
On the other hand, present invention also offers a kind of picture annotation equipment, including:
Picture acquiring unit, for according to goal task Requirement Acquisition Internet picture data;
Picture cleaning unit, for carrying out data cleansing to the Internet picture data for obtaining;
Picture marks unit, for carrying out picture mark according to the Internet picture data after cleaning, and receives picture mark
It is corresponding after note to complete result;
Data set generating unit, for the result that completes according to the mark task labeled data collection is generated.
Alternatively, the picture acquiring unit by the way that default conventional data resource platform or target class are other specifically for hanging down
Straight class website obtains the Internet picture data.
Alternatively, the picture cleaning unit is specifically for using computer vision and deep learning treatment technology, to protecting
The Internet picture data deposited carry out data cleansing from content and semantic level, to filter the picture number for not meeting preset requirement
According to.
Alternatively, the picture cleaning unit is additionally operable to:
The object in the Internet picture is recognized, and the Internet picture is labelled according to the content for identifying,
To filter the image data for not meeting preset requirement according to the result for labelling;
Recognize the content of the Internet picture, and generate the phrase of the content for describing the Internet picture, with basis
The result of phrase generation filters the image data for not meeting preset requirement;
The significance of the object in the Internet picture is detected, the picture entirely without significant characteristics is filtered;
The entity number occurred in the Internet picture is detected, picture of the entity number more than predetermined number is filtered.
Alternatively, the picture mark unit is specifically for according to the Internet picture data genaration mark candidate after cleaning
Data acquisition system, and current task to be marked is determined according to mark candidate data set;And,
According to default labeling system background task allocation algorithm, the current task to be marked is carried out according to pre-set level
Mark.
The picture mask method and device of the present invention, by according to goal task Requirement Acquisition Internet picture data, and
Internet picture data to obtaining carry out data cleansing, and according to the Internet picture data after cleaning picture mark is carried out, and
It is corresponding after reception picture mark to complete result, to generate labeled data collection according to the result that completes, mark number can be improved
According to quantity, quality and mark speed, reach the purpose of quick, inexpensive output quality data annotation results, can be follow-up
Model training provides effective training dataset and closes.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are the present invention
Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis
These accompanying drawings obtain other accompanying drawings.
Fig. 1 is the schematic flow sheet of the picture mask method of one embodiment of the invention;
Fig. 2 is the schematic flow sheet of the cleaning Internet picture data method of one embodiment of the invention;
Fig. 3 is the crawl of one embodiment of the invention and preserves the schematic flow sheet of Internet picture data method;
Fig. 4 is the schematic flow sheet of the cuisines picture mask method of another embodiment of the present invention;
Fig. 5 is the structural representation of the picture annotation equipment of one embodiment of the invention.
Specific embodiment
To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is explicitly described, it is clear that described embodiment be the present invention
A part of embodiment, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art are not having
The every other embodiment obtained under the premise of creative work is made, the scope of protection of the invention is belonged to.
Fig. 1 is the schematic flow sheet of the picture mask method of one embodiment of the invention;As shown in figure 1, the method includes:
S1:According to goal task Requirement Acquisition Internet picture data;
Can hang down by the way that default conventional data resource platform or target class are other as in the preferred of the present embodiment, this step
Straight class website obtains the Internet picture data.
Specifically, the substantially category of entity, clearly requires the number of lookup needed for being determined by mark demand in this step
According to source;Specifically, the resource searching of Internet picture data can be searched and is divided into both direction:One is by default general number
According to search related entities keyword in resource platform (such as searching plain engine in pictures such as Baidu, bing), and obtain general retrieval
As a result;Two is the vertical class website (Vertical website) by searching related entities classification or resource, and is hung down from candidate
The good image data of structuring and related text data are chosen in straight class website.
It should be noted that described entity is to true form or structure, and can be perceived and parent by people
The more abstract general designation of object of hand contact.For example, entity can refer to people, such as teacher, student, it is also possible to refer to thing, such as book,
The objective objects such as warehouse, can also refer to abstract event, such as performance, football match, can be between self-explanatory characters' thing and things
Contact, such as students' needs, client order goods;
The good image data of described structuring refers to the picture containing class categories information and correlation attribute information, example
Such as, the picture of a cake, if show in it under chocolate-like-mousse subclass in vertical class station, and has making former
Material and technique, then this pictures just has the taxonomic structure of a level and has making attribute information.
S2:Internet picture data to obtaining carry out data cleansing;
Specifically, after completing the step of image data is obtained, need to clean image data.
It is understood that according to the difference to demand data, adoptable Data Cleaning Method is also a lot, traditional side
Formula utilizes periphery text message (such as title of the webpage in the webpage comprising picture, the pair of content description of Internet picture
The context text with text explanation and picture of title, picture) auxiliary determination image content, and according to the entity of required mark
Title is filtered to image data.
The problem that above-mentioned way is present is:First, filtration is carried out to picture using pure words to there may be the filter of effective picture
The problem removed;The picture of the text correlation for the 2nd, retaining by means of which, is likely to be completely unrelated from content.
For the problem that above-mentioned traditional data cleansing mode is present, the Internet picture of described pair of acquisition described in step S2
Data carry out data cleansing, can include:
S2’:Using computer vision and deep learning treatment technology, to the Internet picture data that preserve from content and language
Adopted rank carries out data cleansing, to filter the image data for not meeting preset requirement;
For example, Fig. 2 shows the flow process of the cleaning Internet picture data method of one embodiment of the invention, such as schemes
Shown in 2, this step S2 ' specifically can include:
S21’:Picture recognition, is labelled using the object in depth learning technology identification picture and to it, is taken in this step
Top5's predicts the outcome, and according to the probable value of label, filters more than 50% picture for being not belonging to required entity class;
It should be noted that all prediction labels referred to the affiliated label of photo current that predict the outcome of above-mentioned top5
As a result probability marking is ranked up, and takes first 5, and the credible result degree of the generally accepted top5 of industry is higher.
S22’:Picture semantic understands, using depth learning technology identification image content, as a result (is utilized with phrase description
Deep learning algorithm generates the phrase for describing image content), and predicting the outcome for top3 therein is extracted, retouched according to phrase
State content and filter in word 100% picture for physical name and affiliated webpage periphery text do not occur;
S23’:Picture conspicuousness detects that the significance of object, filters entirely without significant characteristics in detection picture
Picture, that is, think exist without effective entity in the picture, available without effective pictorial information;
Existing computer vision Processing Algorithm can be utilized as in the preferred of the present embodiment, this step (such as
Saliency detection algorithms), by indexs such as brightness, the contrasts of pixel in analysis picture, using the ladder of pixel
Angle value and Principle of Statistics are calculated and predict to come that in other words which subregion in picture seems significantly, i.e., by people
Class vision is most attractive at a glance.
S24’:Picture entity number detects that the entity number occurred in detection picture filters entity number more than 5 figures
Piece, reduces the complexity of follow-up mark task, mitigates the workload of mark;
Specifically, this step utilizes depth learning technology, can count the entity number substantially contained in picture, and it is former
The similar picture classification method based on deep learning of reason, by the picture containing different entities number big class is divided into, and is then predicted
Which classification new picture belongs to.Such as, 1 class is belonged to containing 1 entity;Belong to 2 classes etc. containing 2 entities.
S3:Picture mark is carried out according to the Internet picture data after cleaning, and receives corresponding after picture mark completing
As a result;
For example, after data cleansing is completed, you can generate mark candidate data set, according to default mark system
System background task allocation algorithm will mark candidate data set as current task to be marked, by mark person's account quantity and its mark
The note index such as workload, is equably issued to mark person's account, and receive it is corresponding after picture mark complete result, while can be with
The rewards and punishments of correlation are carried out according to task quantity, quality is completed.
Wherein, the content of the rewards and punishments can be empirical value etc., and it can be used to exchange the prizes such as material object.
Specifically, this step can include:
S31:Obtain default mark to require, for describing mark demand in detail, demand includes the positive negative sample of selection mark
Picture makees accessory exhibition;
S32:Test mark is carried out according to demand, and obtains test-object data, statistics annotation results and mark person's feedback, optimization
Mark demand character express;
Specifically, can be according to the result of mark person's feedback, to unclear described in mark requirement or ambiguous place
Modification is made, invalid or wrong mark is removed and is required, allowed mark to require and mask method is apparent, clear and definite.
S33:Statistics collection annotation results, and smart allocation issues mark task;
S34:Effective confirmation result of annotation results is obtained, to carry out rewards and punishments according to the confirmation result;
For example, in Batch labeling task, mark picture is chosen in sampling, receives and watches what mark was obtained by human eye
Confirm result, and judge whether that mark is correct, meet the requirements, if more than 98% mark meets the requirements in sampling set, recognize
To be ok, by mark.Explanation is needed exist for, the annotation results of human eye viewing are a statisticses, i.e., per a figure
What sector-meeting repeated looks for odd number mark person to mark, such as 3 or 5, as a example by 5, the result choosing marked according to them
Take 3:2,4:1,5:0 result is correct, and mark picture is given annotation results remaining as the principle of mistake, is watched by human eye
Annotation results be both this statistical value
S35:Effective Batch labeling result is confirmed, and rewards and punishments are carried out according to the confirmation result;
S4:Labeled data collection is generated according to the result that completes.
It is understood that the diversity of data processing task causes data mark task to equally exist diversity.This step
Suddenly picture to be marked can be imported using predefined mark template and is supplied to mark person's account, mark task promoter also may be used
With the data mark that self-defined mark template according to demand completes particular task, the template can be stored in ATL so that with
Similar mark work afterwards has the flow process of specification.
The picture mask method of the present embodiment, by according to goal task Requirement Acquisition Internet picture data, and to obtaining
The Internet picture data for taking carry out data cleansing, and according to the Internet picture data after cleaning picture mark is carried out, and receive
It is corresponding after picture mark to complete result, to generate labeled data collection according to the result that completes, labeled data can be improved
Quantity, quality and mark speed, reach the purpose of quick, inexpensive output quality data annotation results, can be following model
Training provides effective training dataset and closes.
Further, it is mutual according to goal task Requirement Acquisition described in step S1 as the preferred of said method embodiment
Networking image data, can also include:
S1’:By web crawlers technology the Internet picture data are captured and preserved;
Be able to will be grabbed by presetting identical picture and similar pictures searching algorithm as in the preferred of the present embodiment, this step
The Internet picture data for taking are compared with locally stored image data;And the picture number according to comparison result to repetition
According to being abandoned, and, the picture resource to not appearing in local is downloaded in-stockroom operation.
Specifically, after the image data resource for needing crawl to download is selected, using gripping tools such as web crawlers
And means are obtained to the internet data;Due to Internet picture data volume it is big, it is understood that there may be to much download pictures
Data carry out the situation of repeated downloads, not only occupied bandwidth flow but also repeat storage and can cause to take additional storage space to ask
Topic.For the problem, this step carries out local and Internet picture number by using identical picture and similar pictures searching algorithm
According to the comparison of resource, to realize that the data to repeating are abandoned, and to not appearing in local library in picture resource be downloaded
In-stockroom operation.Specifically, Fig. 3 shows the crawl of one embodiment of the invention and preserves the stream of Internet picture data method
Journey, as shown in figure 3, this step specifically can include:
S11’:Data to downloading and mark set up database, and preserve the information of picture concerned, for example source, category
Property (wide high, form, size sizes, exif information etc.) and the picture the unique volume for similar and identical picture searching
Code;
S12’:The url links of the image data for needing to obtain, have the number of identical url links in searching data storehouse
According to, and using the effective image data preserved in database;
S13’:To can not find corresponding url in database, non-existent data are downloaded, using pre-arranged code algorithm (such as
Uniformity hash algorithm, cryptographic hashing algorithm, MD5/SHA1/SHA256 algorithms etc.) picture unique encodings are generated, in database
The coding of image data is compared, and manner of comparison is:
If the coding more than 50% of download pictures is consistent with certain pictures coding in storehouse, then it is assumed that the data are similar diagram
Piece;
If 100% is consistent, then it is assumed that be identical picture;
Wherein, similar pictures refer to and cutting, rotation, the easy picture of addition word and scaling etc. were done to original image
Operation;Identical picture refers to duplicate picture.
S14’:Result according to finally filtering out sets up database, and preserves the information of picture concerned.
Below the present invention is illustrated with the embodiment of cuisines identification, but do not limit protection scope of the present invention.
Cuisines recognize the process for being the vegetable pattern according to appearance in picture to speculate its menu name.Fig. 4 shows this
The flow process of the cuisines picture mask method of another embodiment is invented, as shown in figure 4, the method includes:
A1:Demand is recognized according to cuisines, Internet picture data are obtained;
For example, the demand refers to needs to find cuisines title and a large amount of with the corresponding related cuisines picture number of title
According to, and the image data amount of every kind of cuisines needs to reach more than hundreds of.
According to the demand, a collection of cuisines name list first can be searched and filtered out from number of site, and (entity is arranged
Table), then go search related entities in general photographic search engine (such as Baidu, bing etc.) crucial with the cuisines in the list
Word;Then the retrieval result quality of data is observed.If the mode of appearance of the vegetable cuisines is bigger than more consistent and data volume, can recognize
To be satisfactory candidate data resource;
Some vertical class cuisines websites (such as going to the kitchen, cuisines outstanding person etc.) can also be on the internet searched for, is searched in station
Cuisines entity name and picture resource and lteral data, according to the quality of data and related text data structured journey in vertical station
Spend height to screen website, and then screen the specific category in cuisines list.
A2:Picture database is pre-build using HBase (the non-relational distributed data base increased income, NoSQL), is used for
Data to downloading and mark are stored, and preserve the information such as image credit url, wide height, size, form and Hash coding.
After the cuisines image data resource for needing crawl to download is chosen, using web crawlers (web crawler) etc.
Gripping tool and means are obtained to image data in search-engine results and cuisines vertically station, and concrete steps include:
A21:Capturing pictures url is linked, and has the data of identical url links in searching data storehouse, if finding matching
Data then use effective image data;
A22:Non-existent url data in database are downloaded, using hash algorithm picture unique encodings are generated, with
The Hash coding of image data is compared in picture database:
If the coding of download pictures has more than 50% to think that the data are with certain pictures Hash coding is consistent in database
Similar pictures;
If 100% it is consistent if be considered identical picture.
Wherein, similar pictures refer to and cutting, rotation, the easy picture behaviour of addition word or scaling etc. were to original image
Make;Identical picture refers to, duplicate picture.
A23:Picture to meeting above-mentioned condition carries out discard processing, and to the remaining result of final screening write picture is carried out
Database manipulation.
A3:After completing the step of cuisines image data is downloaded and pre-processes screening, need to carry out cuisines image data
Cleaning, specifically includes:
A31:Picture recognition, using the object in depth learning technology identification picture and to its dozen of rough sort labels, this step
Predicting the outcome for top5 is taken in rapid, according to the probable value of rough sort label, more than 50% is filtered and is not belonging to required rough sort classification
In picture;
A32:Picture semantic understands, using depth learning technology identification image content, and is described with phrase, takes in this step
Top3's predicts the outcome, and according to phrase description content, filters in word 100% and cuisines menu name and cuisines picture week does not occur
The picture of side text;
A33:Picture conspicuousness detects that the significance of object in detection capturing pictures is filtered special entirely without conspicuousness
Exist without effective entity in the picture levied, the i.e. picture, it is available without effective pictorial information;
A34:Picture entity number detects that the entity number occurred in detection picture filters entity number more than 5 figures
Piece, reduces the complexity of follow-up mark task, mitigates the workload of mark;
A4:After data cleansing, cuisines mark candidate data set is generated, the content of the set is that have a cuisines list
And the corresponding image data set of every kind of cuisines in list.
Labeling system has been pre-build in this step for screening labeled data, the system have user interface and after
Platform administration interface, is available for keeper to supply mark labeled data from back-stage management interfacial distribution task using task allocation algorithms, with
Will current task to be marked by the index such as mark quantity and its mark workload, be equably issued to mark person's account, and according to
Task quantity is completed, quality carries out the rewards and punishments of correlation, specifically includes:
A41:Obtain default cuisines mark to require, for describing mark demand in detail, demand is positive and negative comprising mark is chosen
Samples pictures make accessory exhibition;
A42:Test mark is carried out according to demand, and obtains test-object data, statistics annotation results and mark person's feedback, optimization
Mark demand character express;
A43:Statistics collection annotation results, and smart allocation issues mark task;
A44:Annotation results are checked and is audited, determined annotation results validity;
A45:Effective confirmation result of annotation results is obtained, to carry out rewards and punishments according to the confirmation result;
A5:Labeled data collection is generated according to annotation results;
Specifically, the diversity of data processing task causes data mark task to equally exist diversity.This step can
Mark person's account is supplied to import picture to be marked using predefined mark template, mark task promoter can also root
The data mark of particular task is completed according to the self-defined mark template of demand, the template can be stored in ATL so that later
The work of similar mark has the flow process of specification.
Fig. 5 is the structural representation of the picture annotation equipment of one embodiment of the invention, as shown in figure 5, the device includes:
Picture acquiring unit 10, picture cleaning unit 20, picture mark unit 30 and data set generating unit 40, wherein:
Picture acquiring unit 10 is used for according to goal task Requirement Acquisition Internet picture data;
Used as the preferred of the present embodiment, picture acquiring unit 10 can be by default conventional data resource platform or target class
Other vertical class website obtains the Internet picture data.
Specifically, the resource searching of Internet picture data can be searched and is divided into both direction:One is by default logical
Related entities keyword is searched for data resource platform (such as searching plain engine in pictures such as Baidu, bing) is middle, and obtains general
Retrieval result;Two is the vertical class website (Vertical website) by searching related entities classification or resource, and from time
Select and choose in vertical class website the good image data of structuring and related text data.
Picture cleaning unit 20 is used for the Internet picture data to obtaining carries out data cleansing;
Further, as the preferred of said apparatus embodiment, the picture cleaning unit 20 can be specifically for utilizing
Computer vision and deep learning treatment technology, it is clear that the Internet picture data to preserving carry out data from content and semantic level
Wash, to filter the image data for not meeting preset requirement.
Used as a kind of alternatively embodiment, the picture cleaning unit 20 can be used for recognizing in the Internet picture
Object, and the Internet picture is labelled according to the content for identifying, do not met with being filtered according to the result for labelling
The image data of preset requirement;
For example, label using the object in depth learning technology identification picture and to it, the pre- of top5 is taken in this step
Result is surveyed, according to the probable value of label, more than 50% picture for being not belonging to required entity class is filtered;
Further, the picture cleaning unit 20 can be also used for recognizing the content of the Internet picture, and generate
The phrase of the content of the Internet picture is described, to filter the picture number for not meeting preset requirement according to the result of phrase generation
According to;
For example, image content is recognized using depth learning technology, as a result (is given birth to using deep learning algorithm with phrase description
Into for describing the phrase of image content), and predicting the outcome for top3 therein is extracted, word is filtered according to phrase description content
In 100% picture for physical name and affiliated webpage periphery text do not occur;
Further, the picture cleaning unit 20 can be also used for detecting the notable of the object in the Internet picture
Property level, filters the picture entirely without significant characteristics;
For example, image content is recognized using depth learning technology, as a result (is given birth to using deep learning algorithm with phrase description
Into for describing the phrase of image content), and predicting the outcome for top3 therein is extracted, word is filtered according to phrase description content
In 100% picture for physical name and affiliated webpage periphery text do not occur;
Further, the picture cleaning unit 20 can be also used for detecting the entity occurred in the Internet picture
Number, filters picture of the entity number more than predetermined number.
For example, the entity number occurred in picture is detected, entity number is filtered more than 5 pictures, follow-up mark is reduced and is appointed
The complexity of business, mitigates the workload of mark;
Picture mark unit 30 is used to carry out picture mark according to the Internet picture data after cleaning, and receives picture mark
It is corresponding after note to complete result;
Further, as the preferred of the present embodiment, the picture mark unit 30 can be with specifically for according to cleaning
Internet picture data genaration mark candidate data set afterwards, and currently wait to mark according to the mark candidate data set determination
Note task;And,
According to default labeling system background task allocation algorithm, the current task to be marked is carried out according to pre-set level
Mark.
Specifically, after data cleansing is completed, you can mark candidate data set is generated, according to default labeling system
Background task allocation algorithm will mark candidate data set as current task to be marked, by mark person's account quantity and its mark
The indexs such as workload, are equably issued to mark person's account, and receive it is corresponding after picture mark complete result, while can be by
The rewards and punishments of correlation are carried out according to task quantity, quality is completed.
Wherein, the content of the rewards and punishments can be empirical value etc., and it can be used to exchange the prizes such as material object.
Data set generating unit 40 is used to generate labeled data collection according to the result that completes of the mark task.
Further, as the preferred of said apparatus embodiment, described device also includes that result audits unit, and it can be used
Result is completed in the examination & verification mark task;
Correspondingly, after examination & verification success, the data set generating unit 40 can be additionally used according to the complete of the mark task
Labeled data collection is generated into result.
The picture annotation equipment of the present embodiment, by according to goal task Requirement Acquisition Internet picture data, and to obtaining
The Internet picture data for taking carry out data cleansing, and according to the Internet picture data after cleaning picture mark is carried out, and receive
It is corresponding after picture mark to complete result, to generate labeled data collection according to the result that completes, labeled data can be improved
Quantity, quality and mark speed, reach the purpose of quick, inexpensive output quality data annotation results, can be following model
Training provides effective training dataset and closes.
It should be noted that for device embodiment, due to itself and embodiment of the method basic simlarity, so description
Fairly simple, related part is illustrated referring to the part of embodiment of the method.
Above example is merely to illustrate technical scheme, rather than a limitation;Although with reference to the foregoing embodiments
The present invention has been described in detail, it will be understood by those within the art that:It still can be to aforementioned each enforcement
Technical scheme described in example is modified, or carries out equivalent to which part technical characteristic;And these are changed or replace
Change, do not make the spirit and scope of the essence disengaging various embodiments of the present invention technical scheme of appropriate technical solution.
Claims (13)
1. a kind of picture mask method, it is characterised in that include:
According to goal task Requirement Acquisition Internet picture data;
Internet picture data to obtaining carry out data cleansing;
Picture mark is carried out according to the Internet picture data after cleaning, and receive and corresponding after picture mark complete result;
Labeled data collection is generated according to the result that completes.
2. method according to claim 1, it is characterised in that described according to goal task Requirement Acquisition Internet picture number
According to, including:
The Internet picture data are obtained by default conventional data resource platform or the other vertical class website of target class.
3. method according to claim 1 and 2, it is characterised in that methods described also includes:
When obtaining the Internet picture data, by presetting the internet that identical picture and similar pictures searching algorithm will be captured
Image data is compared with locally stored image data;
The image data for repeating is abandoned according to comparison result, and, the picture resource to not appearing in local is carried out down
Carry in-stockroom operation.
4. method according to claim 1, it is characterised in that it is clear that the Internet picture data of described pair of acquisition carry out data
Wash, including:
Using computer vision and deep learning treatment technology, the Internet picture data to preserving are entered from content and semantic level
Row data cleansing, to filter the image data for not meeting preset requirement.
5. method according to claim 4, it is characterised in that the utilization computer vision and deep learning process skill
Art, the Internet picture data to preserving carry out data cleansing from content and semantic level, including:
The object in the Internet picture is recognized, and the Internet picture is labelled according to the content for identifying, with root
The image data for not meeting preset requirement is filtered according to the result for labelling;
Recognize the content of the Internet picture, and generate the phrase of the content for describing the Internet picture, with according to phrase
The result of generation filters the image data for not meeting preset requirement;
The significance of the object in the Internet picture is detected, the picture entirely without significant characteristics is filtered;
The entity number occurred in the Internet picture is detected, picture of the entity number more than predetermined number is filtered.
6. method according to claim 4, it is characterised in that the object in the detection Internet picture it is notable
Property level, including:
Analyze the brightness of pixel in the Internet picture, contrast index, and the Grad according to pixel and statistics is former
Reason determines the salient region of the Internet picture.
7. method according to claim 1, it is characterised in that the Internet picture data according to after cleaning carry out figure
Piece is marked, including:
According to the Internet picture data genaration mark candidate data set after cleaning, and according to mark candidate data set
It is determined that current task to be marked;
According to default labeling system background task allocation algorithm, the current task to be marked is entered into rower according to pre-set level
Note.
8. the method according to any one of claim 1-2,4-7, it is characterised in that described according to the complete of the mark task
Generate before labeled data collection into result, methods described also includes:
That audits the mark task completes result;
Correspondingly, after examination & verification success, labeled data collection is generated according to the result that completes of the mark task.
9. a kind of picture annotation equipment, it is characterised in that include:
Picture acquiring unit, for according to goal task Requirement Acquisition Internet picture data;
Picture cleaning unit, for carrying out data cleansing to the Internet picture data for obtaining;
Picture marks unit, for carrying out picture mark according to the Internet picture data after cleaning, and receives after picture mark
It is corresponding to complete result;
Data set generating unit, for the result that completes according to the mark task labeled data collection is generated.
10. device according to claim 9, it is characterised in that the picture acquiring unit is specifically for by default logical
The Internet picture data are obtained with the other vertical class website of data resource platform or target class.
11. devices according to claim 9, it is characterised in that the picture cleaning unit is specifically for utilizing computer
Vision and deep learning treatment technology, the Internet picture data to preserving carry out data cleansing from content and semantic level, with
Filter the image data for not meeting preset requirement.
12. devices according to claim 11, it is characterised in that the picture cleaning unit is additionally operable to:
The object in the Internet picture is recognized, and the Internet picture is labelled according to the content for identifying, with root
The image data for not meeting preset requirement is filtered according to the result for labelling;
Recognize the content of the Internet picture, and generate the phrase of the content for describing the Internet picture, with according to phrase
The result of generation filters the image data for not meeting preset requirement;
The significance of the object in the Internet picture is detected, the picture entirely without significant characteristics is filtered;
The entity number occurred in the Internet picture is detected, picture of the entity number more than predetermined number is filtered.
13. devices according to claim 9, it is characterised in that picture mark unit is specifically for after according to cleaning
The mark candidate data set of Internet picture data genaration, and determined according to mark candidate data set current to be marked
Task;And,
According to default labeling system background task allocation algorithm, the current task to be marked is entered into rower according to pre-set level
Note.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611073445.8A CN106649610A (en) | 2016-11-29 | 2016-11-29 | Image labeling method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611073445.8A CN106649610A (en) | 2016-11-29 | 2016-11-29 | Image labeling method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649610A true CN106649610A (en) | 2017-05-10 |
Family
ID=58814135
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611073445.8A Pending CN106649610A (en) | 2016-11-29 | 2016-11-29 | Image labeling method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649610A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368565A (en) * | 2017-07-10 | 2017-11-21 | 美的集团股份有限公司 | Data processing method, data processing equipment and computer-readable recording medium |
CN107423815A (en) * | 2017-08-07 | 2017-12-01 | 北京工业大学 | A kind of computer based low quality classification chart is as data cleaning method |
CN107657269A (en) * | 2017-08-24 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus for being used to train picture purification model |
CN107943588A (en) * | 2017-11-22 | 2018-04-20 | 用友金融信息技术股份有限公司 | Data processing method, system, computer equipment and readable storage medium storing program for executing |
CN108171699A (en) * | 2018-01-11 | 2018-06-15 | 平安科技(深圳)有限公司 | Setting loss Claims Resolution method, server and computer readable storage medium |
CN108319975A (en) * | 2018-01-24 | 2018-07-24 | 北京墨丘科技有限公司 | Data identification method, device, electronic equipment and computer readable storage medium |
CN108427970A (en) * | 2018-03-29 | 2018-08-21 | 厦门美图之家科技有限公司 | Picture mask method and device |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108829815A (en) * | 2018-06-12 | 2018-11-16 | 四川希氏异构医疗科技有限公司 | A kind of medical image method for screening images |
CN108875821A (en) * | 2018-06-08 | 2018-11-23 | Oppo广东移动通信有限公司 | The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing |
CN109783673A (en) * | 2019-01-11 | 2019-05-21 | 海东市平安正阳互联网中医医院有限公司 | A kind of mask method and device of tongue picture image |
WO2019105456A1 (en) * | 2017-11-30 | 2019-06-06 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Image processing method, computer device, and computer readable storage medium |
CN109948727A (en) * | 2019-03-28 | 2019-06-28 | 北京周同科技有限公司 | The training and classification method of image classification model, computer equipment and storage medium |
CN110096480A (en) * | 2019-03-28 | 2019-08-06 | 厦门快商通信息咨询有限公司 | A kind of text marking system, method and storage medium |
CN110363222A (en) * | 2019-06-18 | 2019-10-22 | 中国平安财产保险股份有限公司 | Picture mask method, device, computer equipment and storage medium for model training |
CN110413821A (en) * | 2019-07-31 | 2019-11-05 | 四川长虹电器股份有限公司 | Data mask method |
CN110533066A (en) * | 2019-07-19 | 2019-12-03 | 浙江工业大学 | A kind of image data set method for auto constructing based on deep neural network |
CN110942081A (en) * | 2018-09-25 | 2020-03-31 | 北京嘀嘀无限科技发展有限公司 | Image processing method and device, electronic equipment and readable storage medium |
CN111381743A (en) * | 2018-12-29 | 2020-07-07 | 杭州光启人工智能研究院 | Data marking method, computer device and computer readable storage medium |
WO2020253742A1 (en) * | 2019-06-20 | 2020-12-24 | 杭州睿琪软件有限公司 | Sample labeling checking method and device |
CN112445924A (en) * | 2019-09-04 | 2021-03-05 | 天津职业技术师范大学(中国职业培训指导教师进修中心) | Data mining and transfer learning system based on internet picture resources and method and application thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1936892A (en) * | 2006-10-17 | 2007-03-28 | 浙江大学 | Image content semanteme marking method |
CN102375845A (en) * | 2010-08-19 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Picture searching method and system |
US20120203764A1 (en) * | 2011-02-04 | 2012-08-09 | Wood Mark D | Identifying particular images from a collection |
CN103425715A (en) * | 2012-05-25 | 2013-12-04 | 百度在线网络技术(北京)有限公司 | Method and system for confirming text annotations of pictures |
CN103793697A (en) * | 2014-02-17 | 2014-05-14 | 北京旷视科技有限公司 | Identity labeling method of face images and face identity recognition method of face images |
CN104021207A (en) * | 2014-06-18 | 2014-09-03 | 厦门美图之家科技有限公司 | Food information providing method based on image |
CN105094760A (en) * | 2014-04-28 | 2015-11-25 | 小米科技有限责任公司 | Picture marking method and device |
-
2016
- 2016-11-29 CN CN201611073445.8A patent/CN106649610A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1936892A (en) * | 2006-10-17 | 2007-03-28 | 浙江大学 | Image content semanteme marking method |
CN102375845A (en) * | 2010-08-19 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Picture searching method and system |
US20120203764A1 (en) * | 2011-02-04 | 2012-08-09 | Wood Mark D | Identifying particular images from a collection |
CN103425715A (en) * | 2012-05-25 | 2013-12-04 | 百度在线网络技术(北京)有限公司 | Method and system for confirming text annotations of pictures |
CN103793697A (en) * | 2014-02-17 | 2014-05-14 | 北京旷视科技有限公司 | Identity labeling method of face images and face identity recognition method of face images |
CN105094760A (en) * | 2014-04-28 | 2015-11-25 | 小米科技有限责任公司 | Picture marking method and device |
CN104021207A (en) * | 2014-06-18 | 2014-09-03 | 厦门美图之家科技有限公司 | Food information providing method based on image |
Non-Patent Citations (1)
Title |
---|
杨阳 等: "基于深度学习的图像自动标注算法", 《数据采集与处理》 * |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368565A (en) * | 2017-07-10 | 2017-11-21 | 美的集团股份有限公司 | Data processing method, data processing equipment and computer-readable recording medium |
CN107423815A (en) * | 2017-08-07 | 2017-12-01 | 北京工业大学 | A kind of computer based low quality classification chart is as data cleaning method |
CN107423815B (en) * | 2017-08-07 | 2020-07-31 | 北京工业大学 | Low-quality classified image data cleaning method based on computer |
CN107657269A (en) * | 2017-08-24 | 2018-02-02 | 百度在线网络技术(北京)有限公司 | A kind of method and apparatus for being used to train picture purification model |
CN107943588A (en) * | 2017-11-22 | 2018-04-20 | 用友金融信息技术股份有限公司 | Data processing method, system, computer equipment and readable storage medium storing program for executing |
US11182593B2 (en) | 2017-11-30 | 2021-11-23 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Image processing method, computer device, and computer readable storage medium |
WO2019105456A1 (en) * | 2017-11-30 | 2019-06-06 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Image processing method, computer device, and computer readable storage medium |
CN108171699A (en) * | 2018-01-11 | 2018-06-15 | 平安科技(深圳)有限公司 | Setting loss Claims Resolution method, server and computer readable storage medium |
CN108319975A (en) * | 2018-01-24 | 2018-07-24 | 北京墨丘科技有限公司 | Data identification method, device, electronic equipment and computer readable storage medium |
CN108427970A (en) * | 2018-03-29 | 2018-08-21 | 厦门美图之家科技有限公司 | Picture mask method and device |
CN108875821A (en) * | 2018-06-08 | 2018-11-23 | Oppo广东移动通信有限公司 | The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing |
US11138478B2 (en) | 2018-06-08 | 2021-10-05 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method and apparatus for training, classification model, mobile terminal, and readable storage medium |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108764372B (en) * | 2018-06-08 | 2019-07-16 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108829815A (en) * | 2018-06-12 | 2018-11-16 | 四川希氏异构医疗科技有限公司 | A kind of medical image method for screening images |
CN110942081B (en) * | 2018-09-25 | 2023-08-18 | 北京嘀嘀无限科技发展有限公司 | Image processing method, device, electronic equipment and readable storage medium |
CN110942081A (en) * | 2018-09-25 | 2020-03-31 | 北京嘀嘀无限科技发展有限公司 | Image processing method and device, electronic equipment and readable storage medium |
CN111381743B (en) * | 2018-12-29 | 2022-07-12 | 深圳光启高等理工研究院 | Data marking method, computer device and computer readable storage medium |
CN111381743A (en) * | 2018-12-29 | 2020-07-07 | 杭州光启人工智能研究院 | Data marking method, computer device and computer readable storage medium |
CN109783673A (en) * | 2019-01-11 | 2019-05-21 | 海东市平安正阳互联网中医医院有限公司 | A kind of mask method and device of tongue picture image |
CN110096480A (en) * | 2019-03-28 | 2019-08-06 | 厦门快商通信息咨询有限公司 | A kind of text marking system, method and storage medium |
CN109948727A (en) * | 2019-03-28 | 2019-06-28 | 北京周同科技有限公司 | The training and classification method of image classification model, computer equipment and storage medium |
CN110363222A (en) * | 2019-06-18 | 2019-10-22 | 中国平安财产保险股份有限公司 | Picture mask method, device, computer equipment and storage medium for model training |
WO2020253742A1 (en) * | 2019-06-20 | 2020-12-24 | 杭州睿琪软件有限公司 | Sample labeling checking method and device |
CN110533066A (en) * | 2019-07-19 | 2019-12-03 | 浙江工业大学 | A kind of image data set method for auto constructing based on deep neural network |
CN110533066B (en) * | 2019-07-19 | 2021-12-17 | 浙江工业大学 | Image data set automatic construction method based on deep neural network |
CN110413821A (en) * | 2019-07-31 | 2019-11-05 | 四川长虹电器股份有限公司 | Data mask method |
CN112445924A (en) * | 2019-09-04 | 2021-03-05 | 天津职业技术师范大学(中国职业培训指导教师进修中心) | Data mining and transfer learning system based on internet picture resources and method and application thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649610A (en) | Image labeling method and apparatus | |
Khder | Web scraping or web crawling: State of art, techniques, approaches and application. | |
US20210019287A1 (en) | Systems and methods for populating a structured database based on an image representation of a data table | |
US9449271B2 (en) | Classifying resources using a deep network | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
US9330171B1 (en) | Video annotation using deep network architectures | |
Sigurdsson et al. | Learning visual storylines with skipping recurrent neural networks | |
CN110750656B (en) | Multimedia detection method based on knowledge graph | |
CN107346336A (en) | Information processing method and device based on artificial intelligence | |
CN109409204A (en) | False-proof detection method and device, electronic equipment, storage medium | |
CN108446964B (en) | User recommendation method based on mobile traffic DPI data | |
CN105630767B (en) | The comparative approach and device of a kind of text similarity | |
CN102855480A (en) | Method and device for recognizing characters in image | |
CN109255053A (en) | Resource search method, device, terminal, server, computer readable storage medium | |
CN106708940A (en) | Method and device used for processing pictures | |
Bedeli et al. | Clothing identification via deep learning: forensic applications | |
CN108256537A (en) | A kind of user gender prediction method and system | |
CN110287313A (en) | A kind of the determination method and server of risk subject | |
CN109101476A (en) | A kind of term vector generates, data processing method and device | |
KR102258420B1 (en) | Animaiton contents resource service system and method based on intelligent information technology | |
CN109359517A (en) | Image-recognizing method and device, electronic equipment, storage medium, program product | |
CN110598095B (en) | Method, device and storage medium for identifying article containing specified information | |
CN106537387B (en) | Retrieval/storage image associated with event | |
US8838625B2 (en) | Automated screen scraping via grammar induction | |
CN109598307A (en) | Data screening method, apparatus, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |