CN107729378A - A kind of data mask method - Google Patents
A kind of data mask method Download PDFInfo
- Publication number
- CN107729378A CN107729378A CN201710828902.8A CN201710828902A CN107729378A CN 107729378 A CN107729378 A CN 107729378A CN 201710828902 A CN201710828902 A CN 201710828902A CN 107729378 A CN107729378 A CN 107729378A
- Authority
- CN
- China
- Prior art keywords
- mark
- data
- task
- marked
- person
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention provides a kind of data mask method, including:Data mark task allocation step, and according to the Data Identification code of data to be marked and mark person's identification code, data mark task to be marked is matched with mark person, and the data mark task to be marked is distributed into the mark person according to matching result;Data annotation step, the data to be marked are labeled according to required labeling form;Collection and integration step, after the annotation results of the data mark task to be marked are all submitted, according to the mark of mark person integration and the annotation results, integrate the annotation results, thus it is speculated that go out correct label.
Description
Technical field:
The present invention relates to technology foresight field, the multi-source heterogeneous data labeling system more particularly to based on swarm intelligence.
Technical background:
In recent years, with the rapid development of computer technology and internet, there are various forms of big datas, but count
Make manually to mark language material according to the increase of amount and become abnormal difficult and of a high price, thus to big data data bank filtering,
Mark and with challenge, thus technology mass-rent platform arises at the historic moment.However, mass-rent platform is present, input is big, efficiency is low, at data
The shortcomings of reason amount is small, and mark quality cannot be guaranteed.
For above-mentioned technical problem, publication No. discloses one kind for CN106489149A Chinese patent application and is based on data
Excavate the data mask method and system with mass-rent.The patent proposes that a kind of unique method is entered to annotation results in annotation process
Line flag, it is easy to improve the annotation results degree of accuracy, mark quality can be effectively improved, reduces mark cost.
In CN106489149A Chinese patent application, by obtaining mass-rent annotation results, using the algorithm of integration, mass-rent is marked and tied
Fruit carries out automation examination & verification, screens the annotation results that go wrong, and problem annotation results are marked, and output is examined by automation
The mass-rent annotation results of core, above-mentioned mass-rent annotation results include problem annotation results.
But in technology foresight field, data to be marked are the data on generalized concept, the scope of data mark includes
The mark of art is carried out to paper, patent, news and other network text datas, includes technology foresight field again
Distinctive data mark demand, such as to a certain technology developing stage, the type of skill, periodical importance, research institution's influence power
It is labeled, form is very flexible, and data mark task also has certain difficulty in itself.Therefore, in technology foresight field, for
The different data type in different fields needs the mark person with corresponding mark ability to complete corresponding data mark task.
The data mark task of technology foresight has higher domain knowledge requirement to mark person for these reasons, and above-mentioned publication No. is
Technology disclosed in CN106489149A Chinese patent application can not be competent at the data mark work in technology foresight field.At present
Mark demand in technology foresight field can be met by needing a labeling system, and providing data for technology foresight field marks skill
Art is supported.
The content of the invention:
The scope of the present invention is only by appended claims defined, not by this section content of the invention in any degree
Statement limited.
In order to overcome above-mentioned technical problem, the present invention provides a kind of data mask method, including:Data mark task distribution
Step, according to the Data Identification code of data to be marked and mark person's identification code, data to be marked are marked into task and mark person
Matched, and the data mark task to be marked is distributed into the mark person according to matching result;Data mark step
Suddenly, the data to be marked are labeled according to required labeling form;Collection and integration step, wait to mark described
After the annotation results of the data mark task of note are all submitted, tied according to the mark of mark person integration and the mark
Fruit, integrate the annotation results, thus it is speculated that go out correct label.The above-mentioned technical proposal of the present invention is by the way that data to be marked are marked
Task is matched with mark person, and selection is labeled with the mark person of certain domain knowledge background so that mark precision compared with
Height, technology foresight cost is greatly reduced, improve the ability for carrying out technology foresight.
Preferably, described data mask method also includes:Schedule monitoring step is marked, monitors the data to be marked
Mark progress;Wherein, when not starting to the data mark task to be marked within a specified time, to the number to be marked
Distribution is re-started according to mark task, the data distribution to be marked is marked to the task class of task to the data to be marked
Higher other marks person Biao Zhu not be integrated to continue to mark.The present invention enables data to mark task using above-mentioned technical proposal
Successfully carry out in time, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, described data mask method also includes:Integration renewal step, the quality marked according to the mark person
Update integration of the mark person in corresponding data mark task.The present invention causes data mark to appoint using above-mentioned technical proposal
Business can accurately and effectively distribute correct mark person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, described data mask method also includes:Data mark task category definition step, will be described to be marked
Data mark task be divided into different classes of, and mark task category for each data to be marked and provide uniquely times
Business identification code.The present invention enables data mark task accurately and effectively to distribute correct mark using above-mentioned technical proposal
Person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, in described data mask method, the data to be marked are obtained based on the task identification code
Task category is marked, and the mark is generated to data to be marked each described based on acquired data mark task category
Data Identification code.The present invention enables data mark task accurately and effectively to distribute correct mark using above-mentioned technical proposal
Person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, described data mask method also includes:Pre-treatment step, from described in mark task publisher upload
The data message to be marked is extracted in the initial data of data to be marked.
Preferably, in the pre-treatment step, from the corresponding field of extracting data to be marked.
Preferably, described data mask method also includes:Qualification test step is marked, is applied according to the mark person
Data mark task category, based on the mark person to testing the performance of content, generate each apllied described
Test integration under data mark task category.The present invention enables data mark task accurately to have using above-mentioned technical proposal
Effect ground distributes correct mark person, improves the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, in the mark qualification test step, required by task is marked according to the different pieces of information under different field
The background knowledge and technical ability wanted generate the test content.The present invention to mark qualification test more using above-mentioned technical proposal
With specific aim.
Preferably, in described data mask method, if the test integration of the mark person is higher than set in advance
Threshold values, obtain the mark qualification of the data mark task.The present invention enables data to mark task using above-mentioned technical proposal
It is enough accurately and effectively to distribute correct mark person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, in the mark qualification test step, according to the test result of the mark person, the mark is generated
The identity information of person, it can carry out the classification of data mark task and perform to be somebody's turn to do wherein the identity information includes the mark person
The mark integration of the task of classification.The present invention causes the distribution to data mark task has more to be directed to using above-mentioned technical proposal
Property.
Preferably, in described data mask method, the mark person identification code includes person number's information, affiliated skill
The integration of art realm information, data mark task type.The present invention is caused to data mark task using above-mentioned technical proposal
Distribution is more targeted.
Preferably, in described data mask method, the Data Identification code includes data number and task identification code.
The present invention make it that the distribution to data mark task is more targeted using above-mentioned technical proposal.
Preferably, in described data mask method, the task identification code includes mission number information, task type
Information and cover technical field information.The present invention causes the distribution to data mark task to have more using above-mentioned technical proposal
Specific aim.
Preferably, in the data mark task allocation step, by the data to be marked mark task with it is multiple
Mark person is matched, and the data mark task to be marked is distributed to matching with the data mark to be marked
The corresponding mark of note task integrates higher mark person;In the collection and integration step, in the number to be marked
According to mark task the multiple mark person annotation results all submit after, according to the mark of the multiple mark person integrate with
And the annotation results, integrate multiple annotation results, thus it is speculated that go out correct label.The present invention is caused using above-mentioned technical proposal
Data mark task can accurately and effectively distribute correct mark person, improve the degree of accuracy of technology foresight.
The other hand of the present invention also provides a kind of data annotation equipment, including:At least one processor, described at least one
Individual processor can proceed as follows:, will be to be marked according to the Data Identification code of data to be marked and mark person's identification code
Data mark task is matched with mark person, and the data mark task to be marked is distributed into institute according to matching result
The person that states mark;The data to be marked are labeled according to required labeling form;In the data mark to be marked
After the annotation results of task are all submitted, according to the mark of mark person integration and the annotation results, the mark is integrated
Note result, thus it is speculated that go out correct label.
Preferably, at least one processor can also proceed as follows:Monitor the mark of the data to be marked
Progress;Wherein, when not starting to the data mark task to be marked within a specified time, to the data mark to be marked
Note task re-starts distribution, and the data distribution to be marked is marked to the task category mark of task to the data to be marked
Note, which integrates higher other marks person, to be continued to mark.
Preferably, at least one processor can also proceed as follows:The quality marked according to the mark person
Update integration of the mark person in corresponding data mark task.
Preferably, at least one processor can also proceed as follows:The data mark to be marked is appointed
Being engaged in, it is different classes of to be divided into, and marks task category for each data to be marked and provide unique task identification code.
It is preferably based on the task identification code and obtains the data mark task category to be marked, and is based on being obtained
The data mark task category taken generates the labeled data identification code to data to be marked each described.
Preferably, at least one processor can also proceed as follows:The institute uploaded from mark task publisher
State and the data message to be marked is extracted in the initial data of data to be marked.
Preferably, from the corresponding field of extracting data to be marked.
Preferably, at least one processor can also proceed as follows:According to the apllied number of mark person
According to mark task category, the performance based on the mark person to test content, generate in each apllied data
Mark the test integration under task category.
Preferably, the background knowledge and technical ability wanted according to the different pieces of information mark required by task under different field are directed to
Generate the test content to property.
Preferably, if the test integration of the mark person is higher than threshold values set in advance, obtain the data mark and appoint
The mark qualification of business.
Preferably, according to the test result of the mark person, the identity information of the mark person is generated, wherein the identity
Information includes the mark integration that the mark person can carry out the classification of data mark task and perform the task of the category.
Preferably, the mark person identification code includes person number's information, art information, data mark task
The integration of type.
Preferably, the Data Identification code includes data number and task identification code.
Preferably, the task identification code includes mission number information, task type information and covered technical field letter
Breath.
Preferably, the data mark task to be marked is matched with multiple mark persons, and will be described to be marked
Data mark task to distribute to the mark integration corresponding with the data mark task to be marked matched higher
Mark person;After the annotation results of the multiple mark person of the data mark task to be marked are all submitted, according to institute
The mark integration of multiple mark persons and the annotation results are stated, integrate multiple annotation results, thus it is speculated that go out correct label.
The present invention still further provides a kind of storage medium, and its storage makes at least one processor be able to carry out following behaviour
The program of work:According to the Data Identification code of data to be marked and mark person's identification code, data to be marked are marked into task and mark
Note person is matched, and the data mark task to be marked is distributed into the mark person according to matching result;According to institute
It is required that labeling form the data to be marked are labeled;The annotation results of task are marked in the data to be marked all
After submission, according to the mark of mark person integration and the annotation results, the annotation results are integrated, thus it is speculated that go out correct
Label.
Pass through above-mentioned technical proposal can to multi-source heterogeneous data carry out system mark, its not only can to from news,
The text datas such as paper, patent are labeled, moreover it is possible to technology point developing stage, technology vertex type, periodical importance, research aircraft
Structure influence power etc. is labeled, to the analysis efficiency of big data and the degree of accuracy of technology foresight in the prediction that can develop skill.Separately
Outside, it is labeled, is further increased by the mark person with certain domain knowledge background tested because the present invention uses
Precision is marked, greatly reduces technology foresight cost, improves the ability for carrying out technology foresight.
Brief description of the drawings:
Fig. 1 is the data labeling system configuration diagram in an embodiment of the present invention;
Fig. 2 is the structured flowchart of the mark platform of an embodiment of the present invention;
Fig. 3 is that the data of the mark platform of an embodiment of the present invention mark the structured flowchart of processing system;
Fig. 4 is that the data of an embodiment of the present invention mark process chart;
Fig. 5 is the qualification test flow of mark person in an embodiment of the present invention;
Fig. 6 is that the data of an embodiment of the present invention mark task allocation process diagram.
Embodiment
The present invention is illustrated below according to accompanying drawing illustrated embodiment.This time disclosed embodiment can consider in all sides
Face is to illustrate, without limitation.
Fig. 1 is the data labeling system configuration diagram in present embodiment.As shown in figure 1, multi-source heterogeneous data mark
System includes task publisher with terminal 1, mark platform 2 and mark person with terminal 3.Above-mentioned mark platform 2 passes through 4,5 points of network
Do not communicated to connect with above-mentioned task publisher with terminal 1 and above-mentioned mark person with terminal 3.Above-mentioned task publisher is with the He of terminal 1
Mark person can be the terminal devices such as PC, Pad, mobile phone with terminal 3.Above-mentioned mark platform 2 can be that server etc. is flat
Platform equipment.Above-mentioned network 4,5 can be cable network or wireless network, computer network or mobile communications network etc..Except upper
Network 4,5 is stated, can also be the communication connection mode such as bluetooth outside.
Mark task publisher logs in above-mentioned mark platform 2 with terminal 1 by above-mentioned task publisher and issues and define number
According to mark task.Mark person logs in above-mentioned mark platform 2 with terminal 3 by above-mentioned mark person and receives data mark task and mark
Qualification test and progress data labeling operation etc..Above-mentioned mark platform 2 is issued according to mark task publisher by above-mentioned task
The data mark task and the labeling operation of mark person that person is issued and defined with terminal 1 carry out data mark processing.
Fig. 2 is the structured flowchart of the mark platform of present embodiment.As shown in Fig. 2 above-mentioned mark platform 2 can be service
The platform devices such as device, mainly by composition data processing controller 21, display 22 and the keyboard 23 such as including CPU, ROM and RAM.Number
According to processing controller 21 mainly by CPU21a, ROM21b, RAM21c, hard disk 21d, reading device 21e, input and output interfaces 21f
Formed with communication interface 21g.CPU21a, ROM21b, RAM21c, hard disk 21d, reading device 21e, input and output interfaces 21f and
Communication interface 21g is connected with each other by bus 21i, can receive and dispatch the data etc. in control signal and control mutually.
CPU21a can perform the computer program for being stored in ROM21b and read the computer program in RAM21c.
ROM21b is made up of read-only storage, PROM, EPROM, EEPROM etc., stores the computer journey performed by CPU21a
Sequence and its data used etc..RAM21c is made up of SRAM or DRAM etc., is stored in by reading based on ROM201b and hard disk 21d
Calculation machine program.RAM21c is also used as working space when CPU21a performs these computer programs.
Hard disk 21d stores operating system and application program etc. for the various computer programs of CPU21a execution and its held
Data used in the row computer program.Data mark in present embodiment also is stored in this hard disk 21d with application program 7a
In.
Reading device 21e is made up of floppy drive, CD-ROM drive or DVD-ROM drive etc., can be read be stored in it is portable
The computer program or data of type storage medium 7.Pocket storage medium 7 is stored with data mark application program 7a, above-mentioned
Application program 7a can be read from the pocket storage medium 7 by marking platform 2, be loaded into hard disk 21d.
Above-mentioned application program 7a can not only be provided by pocket storage medium 7, can also by electric communication line from this
Downloaded in external mechanical that electric communication line (no matter wired, wireless) connects, being communicated with above-mentioned mark platform 2.Such as
Above-mentioned application program 7a is stored in the hard disk of the webserver, and above-mentioned mark platform 2 may have access to this server, download the application
Program 7a, load hard disk 21d.
The Windows (registration mark) of hard disk 21d equipped with the production of such as MS etc. provide graphic user interface
Operating system.In the following description, the application program 7a of present embodiment is performed in aforesaid operations system.
Input and output interfaces 21f by the serial line interfaces such as such as USB, IEEE1394, RS-232C, SCSI, IDE,
The parallel interfaces such as IEEE1284 and the analog signal interface being made up of D/A converter and A/D converter etc. are formed.Output input connects
Mouth 21f connects keyboard 23, and user can state the mark input data of platform 2 directly up with keyboard 23.
Communication interface 21g can be such as Ethernet (Ethernet, registration mark) interface.Above-mentioned mark platform 2 passes through
Communication interface 21g can use certain communication protocol and task publisher to be transmitted with terminal 1 and mark person between terminal 3
Data.
Data mark in the hard disk 21d of data processing controller 21 is according to mark with application program 7a major functions
The data that task publisher is issued and defined with terminal 1 by above-mentioned task publisher mark task and the labeling operation of mark person
Carry out data mark processing.
Fig. 3 is that the data of the mark platform of present embodiment mark the structured flowchart of processing system.It is as shown in figure 3, above-mentioned
Data mark processing system includes task definition module 31, data uploading module 32, data processing module 33, task allocating module
34th, labeling module 35, collection are real with integrating module 36, mark person management module 37, mark qualification test module 38, mark
When monitoring module 39.
Above-mentioned task definition module 31 is logged in for performing mark task publisher by above-mentioned task publisher with terminal 1
The operation that above-mentioned mark platform 2 is defined to data mark task.Mark task publisher is according to technology foresight requirement definition
Data mark task category, such as technical field division (technical field such as robot, biology technology class);Sub- technical field
Divide (by taking robotic technology field as an example, sub- technical field can be divided into decelerator, sensor etc.);The type of skill judges (to judge
Which kind of type of skill is one technology belong to, such as subversiveness technology, emerging technology etc.);After task divides, above-mentioned task definition
Module 31 marks task for the data under each task category and provides task identification code TI.Task identification code form is as follows:TI=
{ mission number;Task type;Covered technical field;Cover sub- technical field }.Wherein, mission number is used for unique mark
Current data marks task;Task type represents which kind of current data mark task belongs to;Covered technical field refers to data
Which technical field is data to be marked cover in mark task;Cover sub- technical field and refer to data mark task data to be marked
Which subdomains inside technical field covered.
Above-mentioned data uploading module 32 is used to data to be marked corresponding with data mark task be uploaded to publisher
Above-mentioned mark platform 2, and mark task category according to data and Data Identification code DI, Data Identification are generated to these data to be marked
Code form be:DI={ data numbers;Task identification code TI }.Wherein, data number represents a specific data set to be marked only
One identity;Task identification code includes the relevant information during above-mentioned task identification code defines.
Above-mentioned data processing module 33 is used to treat labeled data progress data prediction, is easy to from mark task publisher
Data message to be marked is extracted in the initial data of upload.Above-mentioned pretreatment is primarily referred to as corresponding from extracting data to be marked
The processes such as field, different fields can be extracted according to the requirement of mark task publisher, above-mentioned data processing module 33, such as
Extract summary, keyword etc..
Above-mentioned task allocating module 34 is used for data mark task to be marked is right according to above-mentioned task definition module 31
The data mark task category that data mark task is defined is matched with mark person and carries out data according to matching result
The distribution of mark task.Task is marked for a certain data, is identified according to the Data Identification code DI of data to be marked and mark person
Code ID, data mark task to be marked is matched with mark person and according to matching result by above-mentioned task allocating module 34
Carry out the distribution of data mark task.Preferentially assign the task to and match during the above-mentioned carry out of task allocating module 34 task distribution
Such data mark task mark integrate higher mark person.To ensure to mark quality, data mark processing system is available for marking
Note task publisher independently sets mark redundancy (odd number), i.e., one mark number according to itself quality requirement to annotation results
According to the mark person's number that can be distributed simultaneously, it is assumed that it is (superfluous as 7 that mark task publisher sets data mark required by task redundancy
Remaining, referring to needs several personal same tasks of mark), then data distribution to be marked is given such to count by above-mentioned task allocating module 34
The 7 mark persons matched according to mark task.
Above-mentioned labeling module 35 is labeled for treating labeled data.Mark person is stepped on by above-mentioned mark person with terminal 3
Record above-mentioned mark platform 2 and carry out data labeling operation, performing mark person by above-mentioned labeling module 35 treats labeled data progress
The operation of mark.According to labeling form difference, data mark processing system can preset different labeling forms and
Efficient mark interactive interface, facilitating mark, person completes data mark task.Such as task, mark person are divided for technical field
Labeling form be to treat labeled data to choose a certain label from multiple labels to be selected as class label, wherein, mark to be selected
Sign and automatically generated based on the information that mark task publisher provides by data mark processing system.
The above results are collected to be used to integrate multiple annotation results with integrating module 36, thus it is speculated that goes out correct label.It is same
Individual data mark task, and data mark processing system can obtain the annotation results from multiple mark persons, the above results collect with
Integrate module 36 to integrate multiple annotation results, thus it is speculated that go out correct label.The method of integration is:According to mark person in the task
The mark integration of classification, it is determined that the reference weight of mark person's annotation results, using weight and annotation results are referred to, is obtained every respectively
The correct degree of kind label, the correct label using the maximum label of correct degree as data to be marked.Finally owned
After the correct label of task, mark task publisher is returned result to.
Above-mentioned mark person management module 37 is used for the relevant information for the person that manages mark.It is above-mentioned according to applicant's test result
Mark person management module 37 automatically generates the identity ID (i.e. mark person identification code ID) of mark task applicant's identity, mark person mark
Knowing code id information and mainly including mark person data mark task category and can perform the mark integration of category task and (mark first
Integration initialization integrates for the test obtained by corresponding data mark task).Mark person's identification code ID information format is ID
={ person number;Art;Affiliated sub- technical field;Data mark task type 1, integration 1;Data mark task
Type 2, integration 2;…}.Wherein, person number represents unique identification of the above-mentioned data mark processing system to mark person
Code;Art refers to which technical field is the data mark task that mark person can be done belong to;Affiliated technology neck
Domain represents which technology subdomains is the data mark task that mark person can be done particularly belong to;Task type represents the mark
The classification of task mark qualification possessed by person;Integration is corresponding with task category to appoint for the person that represents mark in data mark
Level in business, the integration that each data of mark person are marked under task category are not constant, can be held with data mark task
The mark accuracy of mark person during row is updated in real time.
The ability that above-mentioned mark qualification test module 38 is used to mark the data of mark person task is tested.Mark is appointed
Applicant be engaged in when above-mentioned mark platform 2 is registered, oneself art and sub- technical field are selected first, then according to mark
The data mark task of note person application receives corresponding qualification test, and test content is by the above-mentioned basis of mark qualification test module 38
The background knowledge and technical ability that different pieces of information mark required by task under different field is wanted targetedly generate, can be comprehensive
For detecting mark, whether person has the ability for completing a certain data mark task.Data mark task applicant passes through test
Afterwards, above-mentioned mark qualification test module 38 generates test integration of the applicant under each application task category.Appointed according to mark
The threshold values that business publisher sets to each categorical data mark task test score, if mark task applicant's test integration is high
In respective thresholds, then the mark qualification of corresponding task is obtained.
Above-mentioned mark real-time monitoring module 39 be used for monitoring module be mainly responsible for monitoring mark person mark progress and according to
Annotation results renewal mark person marks integration.Above-mentioned mark real-time monitoring module 39 monitors mark person and marks progress, passes through monitoring
Data mark the performance of task, and annotation process is optimized.If it find that mark person does not start within a specified time
Data mark task, and just by data distribution to be marked, to being not yet assigned to, data mark task and mark integrates higher mark person
Continue to mark, it is preferable that distribute to not yet be assigned to data mark task and mark integration highest mark person continue into
Rower is noted.Above-mentioned mark real-time monitoring module 39 updates mark person always according to annotation results and marks integration, is marked according to mark person
As a result mark and integrate with the registration renewal mark person of correct label.If mark person annotation results and correct label registration are very
Greatly, the mark integration of mark person corresponding task classification can rise.If mark person annotation results and correct label registration
Smaller, the mark integration of mark person's corresponding task classification can decline.
Fig. 4 is that the data of present embodiment mark process chart.As shown in figure 4, mark task publisher is according to technology
Anticipated requirements are defined operation to data mark task category, and above-mentioned task definition module 31 is based on above-mentioned defining operation, held
Row data mark the definition (step S1) of task.Mark task publisher draws data mark task according to data mark demand
Be divided into it is different classes of, after the completion of division system according to the rules for each data mark task category unique task identification code is provided
TI, in systems data to be marked with a kind of unique differentiation of task.
After task category division is carried out to data mark task, mark task publisher uploads labeled data, above-mentioned
The task based access control identification code TI of data uploading module 32 obtains data mark task category, and based on fetched data mark task class
It is other that labeled data identification code DI (step S2) is generated to each labeled data.Then, above-mentioned data processing module 33 is to data
Collection carries out data prediction (step S3).The step of above-mentioned pretreatment, extracts including field.Pretreated data latency is divided
The different mark person of dispensing is labeled.
Task is marked for a certain data, according to the Data Identification code DI of data to be marked and mark person's identification code ID, by
Data mark task to be marked is matched and enters line number according to matching result by above-mentioned task allocating module 34 with mark person
According to the distribution (step S4) of mark task.
Treated in mark person during labeled data is labeled, above-mentioned mark real-time monitoring module 39 can monitoring mark in real time
Note person marks progress (step S5).If above-mentioned mark real-time monitoring module 39 finds that mark person does not start within a specified time
Data mark task (step S5:It is no), above-mentioned task allocating module 34 re-starts distribution to data mark task, and this is counted
The task category mark integration highest mark for not carrying out the task mark is given according to the data distribution to be marked under mark task
Person continues to mark.If above-mentioned mark real-time monitoring module 39 finds that mark person within a specified time has begun to data mark and appointed
Be engaged in (step S5:It is), mark person continues to complete data mark task (step S6).In step s 6, above-mentioned labeling module
35, according to input of the mark person based on the data to be marked under above-mentioned data mark task to be marked, treat labeled data
It is labeled, according to the difference of labeling form set in advance, completes to treat the mark task of labeled data.Also, marking
During the quality that is marked according to mark person of the above-mentioned mark real-time monitoring module 39 of task update mark person and marked in corresponding data
Integration in task, and above-mentioned mark person management module 37 records the mark person's that above-mentioned mark real-time monitoring module 39 is updated
Integrate (step S7).
After data mark all annotation results of task are all submitted, the above results are collected with integrating the basis of module 36
The mark integration and annotation results of task mark person, integrates all annotation results, thus it is speculated that go out correct label (step S8).Data
After the completion of mark task, the above results are collected with integrating the collection annotation results of module 36 and annotation results being returned into mark task
Publisher.
Above-mentioned mark real-time monitoring module 39 monitors each data mark task that mark task publisher is issued in real time
Whether the data mark task of classification completes (step S9).If be not fully completed mark task publisher issued it is every
Data mark task (the step S9 of individual data mark task category:It is no), return to step S4 is right by above-mentioned task allocating module 34
The mark task of unfinished data to be marked carries out the distribution of data mark task again;If mark task publisher is sent out
The data mark task of each data mark task category of cloth has completed (step S9:It is), terminate the processing of data mark
(step S10).
Fig. 5 is the qualification test flow of mark person in present embodiment.As shown in figure 5, mark applicant passes through above-mentioned mark
Note person logs in above-mentioned mark platform 2 (step S51) with terminal 3, mark applicant according to oneself art, background knowledge with
And technical ability selects data to be applied to mark task (step S52).Above-mentioned mark qualification test module 38 receives mark person institute
The data mark task category of application, the background knowledge and skill wanted according to the different pieces of information mark required by task under different field
Test content (step S53) can targetedly be generated.After data mark task applicant completes test, above-mentioned mark qualification is surveyed
Die trial block 38 generates test integration of the applicant under each application task category, and is recorded by above-mentioned mark person management module 37
Above-mentioned test integrates (step S54).Each categorical data mark task test score is set according to mark task publisher
Threshold values, above-mentioned mark qualification test module 38 judge that whether mark task applicant tests integration higher than set threshold values (step
Rapid S55).If mark task applicant, which tests integration, is higher than respective thresholds, above-mentioned mark person management module 37 automatically generates mark
The identity ID of note task applicant's identity, obtains the mark qualification (step S56) of corresponding task, then terminates to test (step
S57).If mark task applicant, which tests integration, is less than respective thresholds, directly terminate to test (step S57).
Fig. 6 is that the data of present embodiment mark task allocation process diagram.As shown in fig. 6, for a certain number to be marked
According to mark task, above-mentioned task allocating module 34 obtains the Data Identification code DI and mark person mark of data mark task to be marked
Know code ID (step S61), based on mark person's identification code ID, select such data mark task mark to integrate higher mark person
(step S62).Above-mentioned task allocating module 34 judges whether selected mark person quantity reaches set mark redundancy and want
Ask (step S63).If selected mark person quantity is not up to set mark redundancy and requires (step S63:It is no), it is above-mentioned
Task allocating module 34 continues to select such data mark task mark to integrate higher mark person in remaining mark person.Such as
The mark redundancy that mark person's quantity selected by fruit reaches set requires (step S63:It is), above-mentioned task allocating module 34 will
Selected all mark persons (step S64) that data distribution to be marked matches to such data mark task.
In the above-described embodiment, above-mentioned task definition module 31 is located on above-mentioned mark platform 2, but the present invention is not
It is limited to this, above-mentioned task definition module 31 can also pass through above-mentioned positioned at above-mentioned task publisher with terminal 1, task publisher
Business publisher is with terminal 1 by thereon, the data mark task to be marked of defining operation is uploaded to above-mentioned mark platform 2
Carry out data labeling operation.
The scope of the present invention is not limited by the explanation of implementation below, only as shown in the scope of claims, and
Including having all deformations in the same meaning and right with right.
Claims (15)
1. a kind of data mask method, including:
Data mark task allocation step, will be to be marked according to the Data Identification code of data to be marked and mark person's identification code
Data mark task is matched with mark person, and the data mark task to be marked is distributed into institute according to matching result
The person that states mark;
Data annotation step, the data to be marked are labeled according to required labeling form;
Collection and integration step, after the annotation results of the data mark task to be marked are all submitted, according to institute
The mark integration for the person that states mark and the annotation results, integrate the annotation results, thus it is speculated that go out correct label.
2. data mask method according to claim 1, in addition to:
Schedule monitoring step is marked, monitors the mark progress of the data to be marked;
Wherein, when not starting to the data mark task to be marked within a specified time, to the data mark to be marked
Note task re-starts distribution, and the data distribution to be marked is marked to the task category mark of task to the data to be marked
Note, which integrates higher other marks person, to be continued to mark.
3. data mask method according to claim 1, in addition to:
Integration renewal step, the quality marked according to the mark person update the mark person in corresponding data mark task
Integration.
4. the data mask method according to claims 1 to 3 any one, in addition to:
Data mark task category definition step, the data mark task to be marked are divided into different classes of, and are every
The individual data mark task category to be marked provides unique task identification code.
5. data mask method according to claim 4, it is characterised in that based on the task identification code obtain described in treat
The data mark task category of mark, and data to be marked each described are given birth to based on acquired data mark task category
Into the labeled data identification code.
6. the data mask method according to claims 1 to 3 any one, in addition to:Pre-treatment step, from mark task
The data message to be marked is extracted in the initial data for the data to be marked that publisher uploads.
7. data mask method according to claim 6, it is characterised in that in the pre-treatment step, treated from described
Corresponding field is extracted in labeled data.
8. the data mask method according to claims 1 to 3 any one, in addition to:
Qualification test step is marked, task category is marked according to the apllied data of the mark person, based on the mark person couple
The performance of content is tested, generates the test integration under each apllied data mark task category.
9. data mask method according to claim 8, it is characterised in that in the mark qualification test step, root
The background knowledge and technical ability wanted according to the different pieces of information mark required by task under different field targetedly generate the test
Content.
10. data mask method according to claim 8, it is characterised in that if the test integration of the mark person is high
In threshold values set in advance, the mark qualification of the data mark task is obtained.
11. data mask method according to claim 8, it is characterised in that in the mark qualification test step, root
According to the test result of the mark person, the identity information of the mark person is generated, wherein the identity information includes the mark
Person can carry out the classification of data mark task and perform the mark integration of the task of the category.
12. the data mask method according to claims 1 to 3 any one, it is characterised in that the mark person identification code
The integration of task type is marked comprising person number's information, art information, data.
13. the data mask method according to claims 1 to 3 any one, it is characterised in that the Data Identification code bag
Containing data number and task identification code.
14. data mask method according to claim 13, it is characterised in that the task identification code includes mission number
Information, task type information and covered technical field information.
15. the data mask method according to claims 1 to 3 any one, it is characterised in that
Task allocation step is marked in the data, the data to be marked are marked into task and multiple mark persons progress
Match somebody with somebody, and the data mark task to be marked is distributed to match to mark task with the data to be marked corresponding
Mark integrate higher mark person;
In the collection and integration step, in the mark of the multiple mark person of the data mark task to be marked
As a result after all submitting, according to the mark of the multiple mark person integration and the annotation results, multiple marks are integrated
As a result, thus it is speculated that go out correct label.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710569496 | 2017-07-13 | ||
CN2017105694968 | 2017-07-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107729378A true CN107729378A (en) | 2018-02-23 |
Family
ID=61206268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710828902.8A Pending CN107729378A (en) | 2017-07-13 | 2017-09-14 | A kind of data mask method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107729378A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681811A (en) * | 2018-05-09 | 2018-10-19 | 北京慧听科技有限公司 | A kind of data ecosystem of decentralization |
CN108984490A (en) * | 2018-07-17 | 2018-12-11 | 北京猎户星空科技有限公司 | A kind of data mask method, device, electronic equipment and storage medium |
CN109063043A (en) * | 2018-07-17 | 2018-12-21 | 北京猎户星空科技有限公司 | A kind of data processing method, device, medium and equipment |
CN109255582A (en) * | 2018-07-24 | 2019-01-22 | 武汉空心科技有限公司 | Development approach and system based on fault tolerant mechanism |
CN109492997A (en) * | 2018-10-31 | 2019-03-19 | 四川长虹电器股份有限公司 | A kind of image labeling plateform system based on SpringBoot |
CN109710933A (en) * | 2018-12-25 | 2019-05-03 | 广州天鹏计算机科技有限公司 | Acquisition methods, device, computer equipment and the storage medium of training corpus |
CN110096480A (en) * | 2019-03-28 | 2019-08-06 | 厦门快商通信息咨询有限公司 | A kind of text marking system, method and storage medium |
CN110400029A (en) * | 2018-04-24 | 2019-11-01 | 北京京东尚科信息技术有限公司 | A kind of method and system of mark management |
CN111079376A (en) * | 2019-11-14 | 2020-04-28 | 贝壳技术有限公司 | Data labeling method, device, medium and electronic equipment |
CN111177132A (en) * | 2019-12-20 | 2020-05-19 | 中国平安人寿保险股份有限公司 | Label cleaning method, device, equipment and storage medium for relational data |
CN111339068A (en) * | 2018-12-18 | 2020-06-26 | 北京奇虎科技有限公司 | Crowdsourcing quality control method, apparatus, computer storage medium and computing device |
CN111414950A (en) * | 2020-03-13 | 2020-07-14 | 天津美腾科技股份有限公司 | Ore picture labeling method and system based on professional degree management of annotator |
CN111626835A (en) * | 2020-04-27 | 2020-09-04 | 口碑(上海)信息技术有限公司 | Task configuration method, device, system, storage medium and computer equipment |
CN111859855A (en) * | 2020-06-11 | 2020-10-30 | 第四范式(北京)技术有限公司 | Method, device and equipment for processing labeling task and storage medium |
CN113032649A (en) * | 2019-12-24 | 2021-06-25 | 华为技术有限公司 | Method and device for labeling data, terminal equipment and storage medium |
CN113554130A (en) * | 2021-09-22 | 2021-10-26 | 平安科技(深圳)有限公司 | Data labeling method and device based on artificial intelligence, electronic equipment and medium |
CN115248831A (en) * | 2021-04-28 | 2022-10-28 | 马上消费金融股份有限公司 | Labeling method, device, system, equipment and readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324620A (en) * | 2012-03-20 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for rectifying marking results |
CN104573988A (en) * | 2015-01-28 | 2015-04-29 | 数据堂(北京)科技股份有限公司 | Task outsourcing method and system |
CN104573359A (en) * | 2014-12-31 | 2015-04-29 | 浙江大学 | Method for integrating crowdsource annotation data based on task difficulty and annotator ability |
CN105787521A (en) * | 2016-03-25 | 2016-07-20 | 浙江大学 | Semi-monitoring crowdsourcing marking data integration method facing imbalance of labels |
CN106156025A (en) * | 2015-03-25 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The management method of a kind of data mark and device |
CN106779079A (en) * | 2016-11-23 | 2017-05-31 | 北京师范大学 | A kind of forecasting system and method that state is grasped based on the knowledge point that multimodal data drives |
-
2017
- 2017-09-14 CN CN201710828902.8A patent/CN107729378A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324620A (en) * | 2012-03-20 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for rectifying marking results |
CN104573359A (en) * | 2014-12-31 | 2015-04-29 | 浙江大学 | Method for integrating crowdsource annotation data based on task difficulty and annotator ability |
CN104573988A (en) * | 2015-01-28 | 2015-04-29 | 数据堂(北京)科技股份有限公司 | Task outsourcing method and system |
CN106156025A (en) * | 2015-03-25 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The management method of a kind of data mark and device |
CN105787521A (en) * | 2016-03-25 | 2016-07-20 | 浙江大学 | Semi-monitoring crowdsourcing marking data integration method facing imbalance of labels |
CN106779079A (en) * | 2016-11-23 | 2017-05-31 | 北京师范大学 | A kind of forecasting system and method that state is grasped based on the knowledge point that multimodal data drives |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110400029A (en) * | 2018-04-24 | 2019-11-01 | 北京京东尚科信息技术有限公司 | A kind of method and system of mark management |
CN108681811A (en) * | 2018-05-09 | 2018-10-19 | 北京慧听科技有限公司 | A kind of data ecosystem of decentralization |
CN108681811B (en) * | 2018-05-09 | 2022-10-18 | 北京慧听科技有限公司 | Decentralized data ecosystem |
CN108984490A (en) * | 2018-07-17 | 2018-12-11 | 北京猎户星空科技有限公司 | A kind of data mask method, device, electronic equipment and storage medium |
CN109063043A (en) * | 2018-07-17 | 2018-12-21 | 北京猎户星空科技有限公司 | A kind of data processing method, device, medium and equipment |
CN109255582A (en) * | 2018-07-24 | 2019-01-22 | 武汉空心科技有限公司 | Development approach and system based on fault tolerant mechanism |
CN109492997A (en) * | 2018-10-31 | 2019-03-19 | 四川长虹电器股份有限公司 | A kind of image labeling plateform system based on SpringBoot |
CN111339068A (en) * | 2018-12-18 | 2020-06-26 | 北京奇虎科技有限公司 | Crowdsourcing quality control method, apparatus, computer storage medium and computing device |
CN111339068B (en) * | 2018-12-18 | 2024-04-19 | 北京奇虎科技有限公司 | Crowd-sourced quality control method, device, computer storage medium and computing equipment |
CN109710933A (en) * | 2018-12-25 | 2019-05-03 | 广州天鹏计算机科技有限公司 | Acquisition methods, device, computer equipment and the storage medium of training corpus |
CN110096480A (en) * | 2019-03-28 | 2019-08-06 | 厦门快商通信息咨询有限公司 | A kind of text marking system, method and storage medium |
CN111079376A (en) * | 2019-11-14 | 2020-04-28 | 贝壳技术有限公司 | Data labeling method, device, medium and electronic equipment |
CN111177132A (en) * | 2019-12-20 | 2020-05-19 | 中国平安人寿保险股份有限公司 | Label cleaning method, device, equipment and storage medium for relational data |
CN113032649A (en) * | 2019-12-24 | 2021-06-25 | 华为技术有限公司 | Method and device for labeling data, terminal equipment and storage medium |
CN111414950A (en) * | 2020-03-13 | 2020-07-14 | 天津美腾科技股份有限公司 | Ore picture labeling method and system based on professional degree management of annotator |
CN111414950B (en) * | 2020-03-13 | 2023-08-18 | 天津美腾科技股份有限公司 | Ore picture labeling method and system based on labeling person professional management |
CN111626835A (en) * | 2020-04-27 | 2020-09-04 | 口碑(上海)信息技术有限公司 | Task configuration method, device, system, storage medium and computer equipment |
CN111626835B (en) * | 2020-04-27 | 2024-02-02 | 口碑(上海)信息技术有限公司 | Task configuration method, device, system, storage medium and computer equipment |
CN111859855A (en) * | 2020-06-11 | 2020-10-30 | 第四范式(北京)技术有限公司 | Method, device and equipment for processing labeling task and storage medium |
CN115248831A (en) * | 2021-04-28 | 2022-10-28 | 马上消费金融股份有限公司 | Labeling method, device, system, equipment and readable storage medium |
CN115248831B (en) * | 2021-04-28 | 2024-03-15 | 马上消费金融股份有限公司 | Labeling method, labeling device, labeling system, labeling equipment and readable storage medium |
CN113554130A (en) * | 2021-09-22 | 2021-10-26 | 平安科技(深圳)有限公司 | Data labeling method and device based on artificial intelligence, electronic equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729378A (en) | A kind of data mask method | |
US11152119B2 (en) | Care path analysis and management platform | |
CN109670727A (en) | A kind of participle mark quality evaluation system and appraisal procedure based on crowdsourcing | |
CN1842811A (en) | Customer service support system | |
CN105095623B (en) | Screening assays, platform, server and the system of disease biomarkers | |
CN102663182A (en) | Intelligent virtual maintenance training system for large equipment | |
CN107169586A (en) | Resource optimization method, device and storage medium based on artificial intelligence | |
US20200273580A1 (en) | Ai powered, fully integrated, end-to-end risk assessment process tool | |
CN111986744B (en) | Patient interface generation method and device for medical institution, electronic equipment and medium | |
CN104820901A (en) | Method for evaluating skill of clothing employees at production line based on production on-site data | |
JP6613210B2 (en) | Human resource development support system | |
CN110210751A (en) | Upkeep operation risk analysis method, device and terminal neural network based | |
CN113053513A (en) | Wisdom medical system based on wisdom community | |
Younesi Heravi et al. | Using fuzzy approach in determining critical parameters for optimum safety functions in mega projects (case study: Iran’s construction industry) | |
CN116719911A (en) | Automatic flow generation method, device, equipment and storage medium | |
CN113452852B (en) | Method and device for regulating and controlling number of outbound calls of machine, electronic equipment and storage medium | |
CN114862520A (en) | Product recommendation method and device, computer equipment and storage medium | |
CN107783731A (en) | A kind of big data real-time processing method and processing system | |
CN109549655A (en) | A kind of Experiment of Psychology and physiological monitoring system and its application method | |
Fatima et al. | Knowledge sharing, a key sustainable practice is on risk: An insight from Modern Code Review | |
US20200367834A1 (en) | Device for predicting body weight of a person and device and method for health management | |
Chang et al. | US National Institutes of Health core consolidation–Investing in greater efficiency | |
CN114936776A (en) | Service data processing method, device, equipment and storage medium | |
CN113706111A (en) | Method, device, equipment and medium for processing medical institution process data | |
CN106934480A (en) | Insure grade analysis method, server and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180223 |
|
RJ01 | Rejection of invention patent application after publication |