CN107729378A - A kind of data mask method - Google Patents

A kind of data mask method Download PDF

Info

Publication number
CN107729378A
CN107729378A CN201710828902.8A CN201710828902A CN107729378A CN 107729378 A CN107729378 A CN 107729378A CN 201710828902 A CN201710828902 A CN 201710828902A CN 107729378 A CN107729378 A CN 107729378A
Authority
CN
China
Prior art keywords
mark
data
task
marked
person
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710828902.8A
Other languages
Chinese (zh)
Inventor
陈吉红
陈峥
周源
杨建中
刘宇飞
张凯
林亨
董放
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Huazhong University of Science and Technology
Original Assignee
Tsinghua University
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Huazhong University of Science and Technology filed Critical Tsinghua University
Publication of CN107729378A publication Critical patent/CN107729378A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of data mask method, including:Data mark task allocation step, and according to the Data Identification code of data to be marked and mark person's identification code, data mark task to be marked is matched with mark person, and the data mark task to be marked is distributed into the mark person according to matching result;Data annotation step, the data to be marked are labeled according to required labeling form;Collection and integration step, after the annotation results of the data mark task to be marked are all submitted, according to the mark of mark person integration and the annotation results, integrate the annotation results, thus it is speculated that go out correct label.

Description

A kind of data mask method
Technical field:
The present invention relates to technology foresight field, the multi-source heterogeneous data labeling system more particularly to based on swarm intelligence.
Technical background:
In recent years, with the rapid development of computer technology and internet, there are various forms of big datas, but count Make manually to mark language material according to the increase of amount and become abnormal difficult and of a high price, thus to big data data bank filtering, Mark and with challenge, thus technology mass-rent platform arises at the historic moment.However, mass-rent platform is present, input is big, efficiency is low, at data The shortcomings of reason amount is small, and mark quality cannot be guaranteed.
For above-mentioned technical problem, publication No. discloses one kind for CN106489149A Chinese patent application and is based on data Excavate the data mask method and system with mass-rent.The patent proposes that a kind of unique method is entered to annotation results in annotation process Line flag, it is easy to improve the annotation results degree of accuracy, mark quality can be effectively improved, reduces mark cost. In CN106489149A Chinese patent application, by obtaining mass-rent annotation results, using the algorithm of integration, mass-rent is marked and tied Fruit carries out automation examination & verification, screens the annotation results that go wrong, and problem annotation results are marked, and output is examined by automation The mass-rent annotation results of core, above-mentioned mass-rent annotation results include problem annotation results.
But in technology foresight field, data to be marked are the data on generalized concept, the scope of data mark includes The mark of art is carried out to paper, patent, news and other network text datas, includes technology foresight field again Distinctive data mark demand, such as to a certain technology developing stage, the type of skill, periodical importance, research institution's influence power It is labeled, form is very flexible, and data mark task also has certain difficulty in itself.Therefore, in technology foresight field, for The different data type in different fields needs the mark person with corresponding mark ability to complete corresponding data mark task. The data mark task of technology foresight has higher domain knowledge requirement to mark person for these reasons, and above-mentioned publication No. is Technology disclosed in CN106489149A Chinese patent application can not be competent at the data mark work in technology foresight field.At present Mark demand in technology foresight field can be met by needing a labeling system, and providing data for technology foresight field marks skill Art is supported.
The content of the invention:
The scope of the present invention is only by appended claims defined, not by this section content of the invention in any degree Statement limited.
In order to overcome above-mentioned technical problem, the present invention provides a kind of data mask method, including:Data mark task distribution Step, according to the Data Identification code of data to be marked and mark person's identification code, data to be marked are marked into task and mark person Matched, and the data mark task to be marked is distributed into the mark person according to matching result;Data mark step Suddenly, the data to be marked are labeled according to required labeling form;Collection and integration step, wait to mark described After the annotation results of the data mark task of note are all submitted, tied according to the mark of mark person integration and the mark Fruit, integrate the annotation results, thus it is speculated that go out correct label.The above-mentioned technical proposal of the present invention is by the way that data to be marked are marked Task is matched with mark person, and selection is labeled with the mark person of certain domain knowledge background so that mark precision compared with Height, technology foresight cost is greatly reduced, improve the ability for carrying out technology foresight.
Preferably, described data mask method also includes:Schedule monitoring step is marked, monitors the data to be marked Mark progress;Wherein, when not starting to the data mark task to be marked within a specified time, to the number to be marked Distribution is re-started according to mark task, the data distribution to be marked is marked to the task class of task to the data to be marked Higher other marks person Biao Zhu not be integrated to continue to mark.The present invention enables data to mark task using above-mentioned technical proposal Successfully carry out in time, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, described data mask method also includes:Integration renewal step, the quality marked according to the mark person Update integration of the mark person in corresponding data mark task.The present invention causes data mark to appoint using above-mentioned technical proposal Business can accurately and effectively distribute correct mark person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, described data mask method also includes:Data mark task category definition step, will be described to be marked Data mark task be divided into different classes of, and mark task category for each data to be marked and provide uniquely times Business identification code.The present invention enables data mark task accurately and effectively to distribute correct mark using above-mentioned technical proposal Person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, in described data mask method, the data to be marked are obtained based on the task identification code Task category is marked, and the mark is generated to data to be marked each described based on acquired data mark task category Data Identification code.The present invention enables data mark task accurately and effectively to distribute correct mark using above-mentioned technical proposal Person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, described data mask method also includes:Pre-treatment step, from described in mark task publisher upload The data message to be marked is extracted in the initial data of data to be marked.
Preferably, in the pre-treatment step, from the corresponding field of extracting data to be marked.
Preferably, described data mask method also includes:Qualification test step is marked, is applied according to the mark person Data mark task category, based on the mark person to testing the performance of content, generate each apllied described Test integration under data mark task category.The present invention enables data mark task accurately to have using above-mentioned technical proposal Effect ground distributes correct mark person, improves the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, in the mark qualification test step, required by task is marked according to the different pieces of information under different field The background knowledge and technical ability wanted generate the test content.The present invention to mark qualification test more using above-mentioned technical proposal With specific aim.
Preferably, in described data mask method, if the test integration of the mark person is higher than set in advance Threshold values, obtain the mark qualification of the data mark task.The present invention enables data to mark task using above-mentioned technical proposal It is enough accurately and effectively to distribute correct mark person, improve the analysis efficiency to big data and the degree of accuracy of technology foresight.
Preferably, in the mark qualification test step, according to the test result of the mark person, the mark is generated The identity information of person, it can carry out the classification of data mark task and perform to be somebody's turn to do wherein the identity information includes the mark person The mark integration of the task of classification.The present invention causes the distribution to data mark task has more to be directed to using above-mentioned technical proposal Property.
Preferably, in described data mask method, the mark person identification code includes person number's information, affiliated skill The integration of art realm information, data mark task type.The present invention is caused to data mark task using above-mentioned technical proposal Distribution is more targeted.
Preferably, in described data mask method, the Data Identification code includes data number and task identification code. The present invention make it that the distribution to data mark task is more targeted using above-mentioned technical proposal.
Preferably, in described data mask method, the task identification code includes mission number information, task type Information and cover technical field information.The present invention causes the distribution to data mark task to have more using above-mentioned technical proposal Specific aim.
Preferably, in the data mark task allocation step, by the data to be marked mark task with it is multiple Mark person is matched, and the data mark task to be marked is distributed to matching with the data mark to be marked The corresponding mark of note task integrates higher mark person;In the collection and integration step, in the number to be marked According to mark task the multiple mark person annotation results all submit after, according to the mark of the multiple mark person integrate with And the annotation results, integrate multiple annotation results, thus it is speculated that go out correct label.The present invention is caused using above-mentioned technical proposal Data mark task can accurately and effectively distribute correct mark person, improve the degree of accuracy of technology foresight.
The other hand of the present invention also provides a kind of data annotation equipment, including:At least one processor, described at least one Individual processor can proceed as follows:, will be to be marked according to the Data Identification code of data to be marked and mark person's identification code Data mark task is matched with mark person, and the data mark task to be marked is distributed into institute according to matching result The person that states mark;The data to be marked are labeled according to required labeling form;In the data mark to be marked After the annotation results of task are all submitted, according to the mark of mark person integration and the annotation results, the mark is integrated Note result, thus it is speculated that go out correct label.
Preferably, at least one processor can also proceed as follows:Monitor the mark of the data to be marked Progress;Wherein, when not starting to the data mark task to be marked within a specified time, to the data mark to be marked Note task re-starts distribution, and the data distribution to be marked is marked to the task category mark of task to the data to be marked Note, which integrates higher other marks person, to be continued to mark.
Preferably, at least one processor can also proceed as follows:The quality marked according to the mark person Update integration of the mark person in corresponding data mark task.
Preferably, at least one processor can also proceed as follows:The data mark to be marked is appointed Being engaged in, it is different classes of to be divided into, and marks task category for each data to be marked and provide unique task identification code.
It is preferably based on the task identification code and obtains the data mark task category to be marked, and is based on being obtained The data mark task category taken generates the labeled data identification code to data to be marked each described.
Preferably, at least one processor can also proceed as follows:The institute uploaded from mark task publisher State and the data message to be marked is extracted in the initial data of data to be marked.
Preferably, from the corresponding field of extracting data to be marked.
Preferably, at least one processor can also proceed as follows:According to the apllied number of mark person According to mark task category, the performance based on the mark person to test content, generate in each apllied data Mark the test integration under task category.
Preferably, the background knowledge and technical ability wanted according to the different pieces of information mark required by task under different field are directed to Generate the test content to property.
Preferably, if the test integration of the mark person is higher than threshold values set in advance, obtain the data mark and appoint The mark qualification of business.
Preferably, according to the test result of the mark person, the identity information of the mark person is generated, wherein the identity Information includes the mark integration that the mark person can carry out the classification of data mark task and perform the task of the category.
Preferably, the mark person identification code includes person number's information, art information, data mark task The integration of type.
Preferably, the Data Identification code includes data number and task identification code.
Preferably, the task identification code includes mission number information, task type information and covered technical field letter Breath.
Preferably, the data mark task to be marked is matched with multiple mark persons, and will be described to be marked Data mark task to distribute to the mark integration corresponding with the data mark task to be marked matched higher Mark person;After the annotation results of the multiple mark person of the data mark task to be marked are all submitted, according to institute The mark integration of multiple mark persons and the annotation results are stated, integrate multiple annotation results, thus it is speculated that go out correct label.
The present invention still further provides a kind of storage medium, and its storage makes at least one processor be able to carry out following behaviour The program of work:According to the Data Identification code of data to be marked and mark person's identification code, data to be marked are marked into task and mark Note person is matched, and the data mark task to be marked is distributed into the mark person according to matching result;According to institute It is required that labeling form the data to be marked are labeled;The annotation results of task are marked in the data to be marked all After submission, according to the mark of mark person integration and the annotation results, the annotation results are integrated, thus it is speculated that go out correct Label.
Pass through above-mentioned technical proposal can to multi-source heterogeneous data carry out system mark, its not only can to from news, The text datas such as paper, patent are labeled, moreover it is possible to technology point developing stage, technology vertex type, periodical importance, research aircraft Structure influence power etc. is labeled, to the analysis efficiency of big data and the degree of accuracy of technology foresight in the prediction that can develop skill.Separately Outside, it is labeled, is further increased by the mark person with certain domain knowledge background tested because the present invention uses Precision is marked, greatly reduces technology foresight cost, improves the ability for carrying out technology foresight.
Brief description of the drawings:
Fig. 1 is the data labeling system configuration diagram in an embodiment of the present invention;
Fig. 2 is the structured flowchart of the mark platform of an embodiment of the present invention;
Fig. 3 is that the data of the mark platform of an embodiment of the present invention mark the structured flowchart of processing system;
Fig. 4 is that the data of an embodiment of the present invention mark process chart;
Fig. 5 is the qualification test flow of mark person in an embodiment of the present invention;
Fig. 6 is that the data of an embodiment of the present invention mark task allocation process diagram.
Embodiment
The present invention is illustrated below according to accompanying drawing illustrated embodiment.This time disclosed embodiment can consider in all sides Face is to illustrate, without limitation.
Fig. 1 is the data labeling system configuration diagram in present embodiment.As shown in figure 1, multi-source heterogeneous data mark System includes task publisher with terminal 1, mark platform 2 and mark person with terminal 3.Above-mentioned mark platform 2 passes through 4,5 points of network Do not communicated to connect with above-mentioned task publisher with terminal 1 and above-mentioned mark person with terminal 3.Above-mentioned task publisher is with the He of terminal 1 Mark person can be the terminal devices such as PC, Pad, mobile phone with terminal 3.Above-mentioned mark platform 2 can be that server etc. is flat Platform equipment.Above-mentioned network 4,5 can be cable network or wireless network, computer network or mobile communications network etc..Except upper Network 4,5 is stated, can also be the communication connection mode such as bluetooth outside.
Mark task publisher logs in above-mentioned mark platform 2 with terminal 1 by above-mentioned task publisher and issues and define number According to mark task.Mark person logs in above-mentioned mark platform 2 with terminal 3 by above-mentioned mark person and receives data mark task and mark Qualification test and progress data labeling operation etc..Above-mentioned mark platform 2 is issued according to mark task publisher by above-mentioned task The data mark task and the labeling operation of mark person that person is issued and defined with terminal 1 carry out data mark processing.
Fig. 2 is the structured flowchart of the mark platform of present embodiment.As shown in Fig. 2 above-mentioned mark platform 2 can be service The platform devices such as device, mainly by composition data processing controller 21, display 22 and the keyboard 23 such as including CPU, ROM and RAM.Number According to processing controller 21 mainly by CPU21a, ROM21b, RAM21c, hard disk 21d, reading device 21e, input and output interfaces 21f Formed with communication interface 21g.CPU21a, ROM21b, RAM21c, hard disk 21d, reading device 21e, input and output interfaces 21f and Communication interface 21g is connected with each other by bus 21i, can receive and dispatch the data etc. in control signal and control mutually.
CPU21a can perform the computer program for being stored in ROM21b and read the computer program in RAM21c.
ROM21b is made up of read-only storage, PROM, EPROM, EEPROM etc., stores the computer journey performed by CPU21a Sequence and its data used etc..RAM21c is made up of SRAM or DRAM etc., is stored in by reading based on ROM201b and hard disk 21d Calculation machine program.RAM21c is also used as working space when CPU21a performs these computer programs.
Hard disk 21d stores operating system and application program etc. for the various computer programs of CPU21a execution and its held Data used in the row computer program.Data mark in present embodiment also is stored in this hard disk 21d with application program 7a In.
Reading device 21e is made up of floppy drive, CD-ROM drive or DVD-ROM drive etc., can be read be stored in it is portable The computer program or data of type storage medium 7.Pocket storage medium 7 is stored with data mark application program 7a, above-mentioned Application program 7a can be read from the pocket storage medium 7 by marking platform 2, be loaded into hard disk 21d.
Above-mentioned application program 7a can not only be provided by pocket storage medium 7, can also by electric communication line from this Downloaded in external mechanical that electric communication line (no matter wired, wireless) connects, being communicated with above-mentioned mark platform 2.Such as Above-mentioned application program 7a is stored in the hard disk of the webserver, and above-mentioned mark platform 2 may have access to this server, download the application Program 7a, load hard disk 21d.
The Windows (registration mark) of hard disk 21d equipped with the production of such as MS etc. provide graphic user interface Operating system.In the following description, the application program 7a of present embodiment is performed in aforesaid operations system.
Input and output interfaces 21f by the serial line interfaces such as such as USB, IEEE1394, RS-232C, SCSI, IDE, The parallel interfaces such as IEEE1284 and the analog signal interface being made up of D/A converter and A/D converter etc. are formed.Output input connects Mouth 21f connects keyboard 23, and user can state the mark input data of platform 2 directly up with keyboard 23.
Communication interface 21g can be such as Ethernet (Ethernet, registration mark) interface.Above-mentioned mark platform 2 passes through Communication interface 21g can use certain communication protocol and task publisher to be transmitted with terminal 1 and mark person between terminal 3 Data.
Data mark in the hard disk 21d of data processing controller 21 is according to mark with application program 7a major functions The data that task publisher is issued and defined with terminal 1 by above-mentioned task publisher mark task and the labeling operation of mark person Carry out data mark processing.
Fig. 3 is that the data of the mark platform of present embodiment mark the structured flowchart of processing system.It is as shown in figure 3, above-mentioned Data mark processing system includes task definition module 31, data uploading module 32, data processing module 33, task allocating module 34th, labeling module 35, collection are real with integrating module 36, mark person management module 37, mark qualification test module 38, mark When monitoring module 39.
Above-mentioned task definition module 31 is logged in for performing mark task publisher by above-mentioned task publisher with terminal 1 The operation that above-mentioned mark platform 2 is defined to data mark task.Mark task publisher is according to technology foresight requirement definition Data mark task category, such as technical field division (technical field such as robot, biology technology class);Sub- technical field Divide (by taking robotic technology field as an example, sub- technical field can be divided into decelerator, sensor etc.);The type of skill judges (to judge Which kind of type of skill is one technology belong to, such as subversiveness technology, emerging technology etc.);After task divides, above-mentioned task definition Module 31 marks task for the data under each task category and provides task identification code TI.Task identification code form is as follows:TI= { mission number;Task type;Covered technical field;Cover sub- technical field }.Wherein, mission number is used for unique mark Current data marks task;Task type represents which kind of current data mark task belongs to;Covered technical field refers to data Which technical field is data to be marked cover in mark task;Cover sub- technical field and refer to data mark task data to be marked Which subdomains inside technical field covered.
Above-mentioned data uploading module 32 is used to data to be marked corresponding with data mark task be uploaded to publisher Above-mentioned mark platform 2, and mark task category according to data and Data Identification code DI, Data Identification are generated to these data to be marked Code form be:DI={ data numbers;Task identification code TI }.Wherein, data number represents a specific data set to be marked only One identity;Task identification code includes the relevant information during above-mentioned task identification code defines.
Above-mentioned data processing module 33 is used to treat labeled data progress data prediction, is easy to from mark task publisher Data message to be marked is extracted in the initial data of upload.Above-mentioned pretreatment is primarily referred to as corresponding from extracting data to be marked The processes such as field, different fields can be extracted according to the requirement of mark task publisher, above-mentioned data processing module 33, such as Extract summary, keyword etc..
Above-mentioned task allocating module 34 is used for data mark task to be marked is right according to above-mentioned task definition module 31 The data mark task category that data mark task is defined is matched with mark person and carries out data according to matching result The distribution of mark task.Task is marked for a certain data, is identified according to the Data Identification code DI of data to be marked and mark person Code ID, data mark task to be marked is matched with mark person and according to matching result by above-mentioned task allocating module 34 Carry out the distribution of data mark task.Preferentially assign the task to and match during the above-mentioned carry out of task allocating module 34 task distribution Such data mark task mark integrate higher mark person.To ensure to mark quality, data mark processing system is available for marking Note task publisher independently sets mark redundancy (odd number), i.e., one mark number according to itself quality requirement to annotation results According to the mark person's number that can be distributed simultaneously, it is assumed that it is (superfluous as 7 that mark task publisher sets data mark required by task redundancy Remaining, referring to needs several personal same tasks of mark), then data distribution to be marked is given such to count by above-mentioned task allocating module 34 The 7 mark persons matched according to mark task.
Above-mentioned labeling module 35 is labeled for treating labeled data.Mark person is stepped on by above-mentioned mark person with terminal 3 Record above-mentioned mark platform 2 and carry out data labeling operation, performing mark person by above-mentioned labeling module 35 treats labeled data progress The operation of mark.According to labeling form difference, data mark processing system can preset different labeling forms and Efficient mark interactive interface, facilitating mark, person completes data mark task.Such as task, mark person are divided for technical field Labeling form be to treat labeled data to choose a certain label from multiple labels to be selected as class label, wherein, mark to be selected Sign and automatically generated based on the information that mark task publisher provides by data mark processing system.
The above results are collected to be used to integrate multiple annotation results with integrating module 36, thus it is speculated that goes out correct label.It is same Individual data mark task, and data mark processing system can obtain the annotation results from multiple mark persons, the above results collect with Integrate module 36 to integrate multiple annotation results, thus it is speculated that go out correct label.The method of integration is:According to mark person in the task The mark integration of classification, it is determined that the reference weight of mark person's annotation results, using weight and annotation results are referred to, is obtained every respectively The correct degree of kind label, the correct label using the maximum label of correct degree as data to be marked.Finally owned After the correct label of task, mark task publisher is returned result to.
Above-mentioned mark person management module 37 is used for the relevant information for the person that manages mark.It is above-mentioned according to applicant's test result Mark person management module 37 automatically generates the identity ID (i.e. mark person identification code ID) of mark task applicant's identity, mark person mark Knowing code id information and mainly including mark person data mark task category and can perform the mark integration of category task and (mark first Integration initialization integrates for the test obtained by corresponding data mark task).Mark person's identification code ID information format is ID ={ person number;Art;Affiliated sub- technical field;Data mark task type 1, integration 1;Data mark task Type 2, integration 2;…}.Wherein, person number represents unique identification of the above-mentioned data mark processing system to mark person Code;Art refers to which technical field is the data mark task that mark person can be done belong to;Affiliated technology neck Domain represents which technology subdomains is the data mark task that mark person can be done particularly belong to;Task type represents the mark The classification of task mark qualification possessed by person;Integration is corresponding with task category to appoint for the person that represents mark in data mark Level in business, the integration that each data of mark person are marked under task category are not constant, can be held with data mark task The mark accuracy of mark person during row is updated in real time.
The ability that above-mentioned mark qualification test module 38 is used to mark the data of mark person task is tested.Mark is appointed Applicant be engaged in when above-mentioned mark platform 2 is registered, oneself art and sub- technical field are selected first, then according to mark The data mark task of note person application receives corresponding qualification test, and test content is by the above-mentioned basis of mark qualification test module 38 The background knowledge and technical ability that different pieces of information mark required by task under different field is wanted targetedly generate, can be comprehensive For detecting mark, whether person has the ability for completing a certain data mark task.Data mark task applicant passes through test Afterwards, above-mentioned mark qualification test module 38 generates test integration of the applicant under each application task category.Appointed according to mark The threshold values that business publisher sets to each categorical data mark task test score, if mark task applicant's test integration is high In respective thresholds, then the mark qualification of corresponding task is obtained.
Above-mentioned mark real-time monitoring module 39 be used for monitoring module be mainly responsible for monitoring mark person mark progress and according to Annotation results renewal mark person marks integration.Above-mentioned mark real-time monitoring module 39 monitors mark person and marks progress, passes through monitoring Data mark the performance of task, and annotation process is optimized.If it find that mark person does not start within a specified time Data mark task, and just by data distribution to be marked, to being not yet assigned to, data mark task and mark integrates higher mark person Continue to mark, it is preferable that distribute to not yet be assigned to data mark task and mark integration highest mark person continue into Rower is noted.Above-mentioned mark real-time monitoring module 39 updates mark person always according to annotation results and marks integration, is marked according to mark person As a result mark and integrate with the registration renewal mark person of correct label.If mark person annotation results and correct label registration are very Greatly, the mark integration of mark person corresponding task classification can rise.If mark person annotation results and correct label registration Smaller, the mark integration of mark person's corresponding task classification can decline.
Fig. 4 is that the data of present embodiment mark process chart.As shown in figure 4, mark task publisher is according to technology Anticipated requirements are defined operation to data mark task category, and above-mentioned task definition module 31 is based on above-mentioned defining operation, held Row data mark the definition (step S1) of task.Mark task publisher draws data mark task according to data mark demand Be divided into it is different classes of, after the completion of division system according to the rules for each data mark task category unique task identification code is provided TI, in systems data to be marked with a kind of unique differentiation of task.
After task category division is carried out to data mark task, mark task publisher uploads labeled data, above-mentioned The task based access control identification code TI of data uploading module 32 obtains data mark task category, and based on fetched data mark task class It is other that labeled data identification code DI (step S2) is generated to each labeled data.Then, above-mentioned data processing module 33 is to data Collection carries out data prediction (step S3).The step of above-mentioned pretreatment, extracts including field.Pretreated data latency is divided The different mark person of dispensing is labeled.
Task is marked for a certain data, according to the Data Identification code DI of data to be marked and mark person's identification code ID, by Data mark task to be marked is matched and enters line number according to matching result by above-mentioned task allocating module 34 with mark person According to the distribution (step S4) of mark task.
Treated in mark person during labeled data is labeled, above-mentioned mark real-time monitoring module 39 can monitoring mark in real time Note person marks progress (step S5).If above-mentioned mark real-time monitoring module 39 finds that mark person does not start within a specified time Data mark task (step S5:It is no), above-mentioned task allocating module 34 re-starts distribution to data mark task, and this is counted The task category mark integration highest mark for not carrying out the task mark is given according to the data distribution to be marked under mark task Person continues to mark.If above-mentioned mark real-time monitoring module 39 finds that mark person within a specified time has begun to data mark and appointed Be engaged in (step S5:It is), mark person continues to complete data mark task (step S6).In step s 6, above-mentioned labeling module 35, according to input of the mark person based on the data to be marked under above-mentioned data mark task to be marked, treat labeled data It is labeled, according to the difference of labeling form set in advance, completes to treat the mark task of labeled data.Also, marking During the quality that is marked according to mark person of the above-mentioned mark real-time monitoring module 39 of task update mark person and marked in corresponding data Integration in task, and above-mentioned mark person management module 37 records the mark person's that above-mentioned mark real-time monitoring module 39 is updated Integrate (step S7).
After data mark all annotation results of task are all submitted, the above results are collected with integrating the basis of module 36 The mark integration and annotation results of task mark person, integrates all annotation results, thus it is speculated that go out correct label (step S8).Data After the completion of mark task, the above results are collected with integrating the collection annotation results of module 36 and annotation results being returned into mark task Publisher.
Above-mentioned mark real-time monitoring module 39 monitors each data mark task that mark task publisher is issued in real time Whether the data mark task of classification completes (step S9).If be not fully completed mark task publisher issued it is every Data mark task (the step S9 of individual data mark task category:It is no), return to step S4 is right by above-mentioned task allocating module 34 The mark task of unfinished data to be marked carries out the distribution of data mark task again;If mark task publisher is sent out The data mark task of each data mark task category of cloth has completed (step S9:It is), terminate the processing of data mark (step S10).
Fig. 5 is the qualification test flow of mark person in present embodiment.As shown in figure 5, mark applicant passes through above-mentioned mark Note person logs in above-mentioned mark platform 2 (step S51) with terminal 3, mark applicant according to oneself art, background knowledge with And technical ability selects data to be applied to mark task (step S52).Above-mentioned mark qualification test module 38 receives mark person institute The data mark task category of application, the background knowledge and skill wanted according to the different pieces of information mark required by task under different field Test content (step S53) can targetedly be generated.After data mark task applicant completes test, above-mentioned mark qualification is surveyed Die trial block 38 generates test integration of the applicant under each application task category, and is recorded by above-mentioned mark person management module 37 Above-mentioned test integrates (step S54).Each categorical data mark task test score is set according to mark task publisher Threshold values, above-mentioned mark qualification test module 38 judge that whether mark task applicant tests integration higher than set threshold values (step Rapid S55).If mark task applicant, which tests integration, is higher than respective thresholds, above-mentioned mark person management module 37 automatically generates mark The identity ID of note task applicant's identity, obtains the mark qualification (step S56) of corresponding task, then terminates to test (step S57).If mark task applicant, which tests integration, is less than respective thresholds, directly terminate to test (step S57).
Fig. 6 is that the data of present embodiment mark task allocation process diagram.As shown in fig. 6, for a certain number to be marked According to mark task, above-mentioned task allocating module 34 obtains the Data Identification code DI and mark person mark of data mark task to be marked Know code ID (step S61), based on mark person's identification code ID, select such data mark task mark to integrate higher mark person (step S62).Above-mentioned task allocating module 34 judges whether selected mark person quantity reaches set mark redundancy and want Ask (step S63).If selected mark person quantity is not up to set mark redundancy and requires (step S63:It is no), it is above-mentioned Task allocating module 34 continues to select such data mark task mark to integrate higher mark person in remaining mark person.Such as The mark redundancy that mark person's quantity selected by fruit reaches set requires (step S63:It is), above-mentioned task allocating module 34 will Selected all mark persons (step S64) that data distribution to be marked matches to such data mark task.
In the above-described embodiment, above-mentioned task definition module 31 is located on above-mentioned mark platform 2, but the present invention is not It is limited to this, above-mentioned task definition module 31 can also pass through above-mentioned positioned at above-mentioned task publisher with terminal 1, task publisher Business publisher is with terminal 1 by thereon, the data mark task to be marked of defining operation is uploaded to above-mentioned mark platform 2 Carry out data labeling operation.
The scope of the present invention is not limited by the explanation of implementation below, only as shown in the scope of claims, and Including having all deformations in the same meaning and right with right.

Claims (15)

1. a kind of data mask method, including:
Data mark task allocation step, will be to be marked according to the Data Identification code of data to be marked and mark person's identification code Data mark task is matched with mark person, and the data mark task to be marked is distributed into institute according to matching result The person that states mark;
Data annotation step, the data to be marked are labeled according to required labeling form;
Collection and integration step, after the annotation results of the data mark task to be marked are all submitted, according to institute The mark integration for the person that states mark and the annotation results, integrate the annotation results, thus it is speculated that go out correct label.
2. data mask method according to claim 1, in addition to:
Schedule monitoring step is marked, monitors the mark progress of the data to be marked;
Wherein, when not starting to the data mark task to be marked within a specified time, to the data mark to be marked Note task re-starts distribution, and the data distribution to be marked is marked to the task category mark of task to the data to be marked Note, which integrates higher other marks person, to be continued to mark.
3. data mask method according to claim 1, in addition to:
Integration renewal step, the quality marked according to the mark person update the mark person in corresponding data mark task Integration.
4. the data mask method according to claims 1 to 3 any one, in addition to:
Data mark task category definition step, the data mark task to be marked are divided into different classes of, and are every The individual data mark task category to be marked provides unique task identification code.
5. data mask method according to claim 4, it is characterised in that based on the task identification code obtain described in treat The data mark task category of mark, and data to be marked each described are given birth to based on acquired data mark task category Into the labeled data identification code.
6. the data mask method according to claims 1 to 3 any one, in addition to:Pre-treatment step, from mark task The data message to be marked is extracted in the initial data for the data to be marked that publisher uploads.
7. data mask method according to claim 6, it is characterised in that in the pre-treatment step, treated from described Corresponding field is extracted in labeled data.
8. the data mask method according to claims 1 to 3 any one, in addition to:
Qualification test step is marked, task category is marked according to the apllied data of the mark person, based on the mark person couple The performance of content is tested, generates the test integration under each apllied data mark task category.
9. data mask method according to claim 8, it is characterised in that in the mark qualification test step, root The background knowledge and technical ability wanted according to the different pieces of information mark required by task under different field targetedly generate the test Content.
10. data mask method according to claim 8, it is characterised in that if the test integration of the mark person is high In threshold values set in advance, the mark qualification of the data mark task is obtained.
11. data mask method according to claim 8, it is characterised in that in the mark qualification test step, root According to the test result of the mark person, the identity information of the mark person is generated, wherein the identity information includes the mark Person can carry out the classification of data mark task and perform the mark integration of the task of the category.
12. the data mask method according to claims 1 to 3 any one, it is characterised in that the mark person identification code The integration of task type is marked comprising person number's information, art information, data.
13. the data mask method according to claims 1 to 3 any one, it is characterised in that the Data Identification code bag Containing data number and task identification code.
14. data mask method according to claim 13, it is characterised in that the task identification code includes mission number Information, task type information and covered technical field information.
15. the data mask method according to claims 1 to 3 any one, it is characterised in that
Task allocation step is marked in the data, the data to be marked are marked into task and multiple mark persons progress Match somebody with somebody, and the data mark task to be marked is distributed to match to mark task with the data to be marked corresponding Mark integrate higher mark person;
In the collection and integration step, in the mark of the multiple mark person of the data mark task to be marked As a result after all submitting, according to the mark of the multiple mark person integration and the annotation results, multiple marks are integrated As a result, thus it is speculated that go out correct label.
CN201710828902.8A 2017-07-13 2017-09-14 A kind of data mask method Pending CN107729378A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710569496 2017-07-13
CN2017105694968 2017-07-13

Publications (1)

Publication Number Publication Date
CN107729378A true CN107729378A (en) 2018-02-23

Family

ID=61206268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710828902.8A Pending CN107729378A (en) 2017-07-13 2017-09-14 A kind of data mask method

Country Status (1)

Country Link
CN (1) CN107729378A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681811A (en) * 2018-05-09 2018-10-19 北京慧听科技有限公司 A kind of data ecosystem of decentralization
CN108984490A (en) * 2018-07-17 2018-12-11 北京猎户星空科技有限公司 A kind of data mask method, device, electronic equipment and storage medium
CN109063043A (en) * 2018-07-17 2018-12-21 北京猎户星空科技有限公司 A kind of data processing method, device, medium and equipment
CN109255582A (en) * 2018-07-24 2019-01-22 武汉空心科技有限公司 Development approach and system based on fault tolerant mechanism
CN109492997A (en) * 2018-10-31 2019-03-19 四川长虹电器股份有限公司 A kind of image labeling plateform system based on SpringBoot
CN109710933A (en) * 2018-12-25 2019-05-03 广州天鹏计算机科技有限公司 Acquisition methods, device, computer equipment and the storage medium of training corpus
CN110096480A (en) * 2019-03-28 2019-08-06 厦门快商通信息咨询有限公司 A kind of text marking system, method and storage medium
CN110400029A (en) * 2018-04-24 2019-11-01 北京京东尚科信息技术有限公司 A kind of method and system of mark management
CN111079376A (en) * 2019-11-14 2020-04-28 贝壳技术有限公司 Data labeling method, device, medium and electronic equipment
CN111177132A (en) * 2019-12-20 2020-05-19 中国平安人寿保险股份有限公司 Label cleaning method, device, equipment and storage medium for relational data
CN111339068A (en) * 2018-12-18 2020-06-26 北京奇虎科技有限公司 Crowdsourcing quality control method, apparatus, computer storage medium and computing device
CN111414950A (en) * 2020-03-13 2020-07-14 天津美腾科技股份有限公司 Ore picture labeling method and system based on professional degree management of annotator
CN111626835A (en) * 2020-04-27 2020-09-04 口碑(上海)信息技术有限公司 Task configuration method, device, system, storage medium and computer equipment
CN111859855A (en) * 2020-06-11 2020-10-30 第四范式(北京)技术有限公司 Method, device and equipment for processing labeling task and storage medium
CN113032649A (en) * 2019-12-24 2021-06-25 华为技术有限公司 Method and device for labeling data, terminal equipment and storage medium
CN113554130A (en) * 2021-09-22 2021-10-26 平安科技(深圳)有限公司 Data labeling method and device based on artificial intelligence, electronic equipment and medium
CN115248831A (en) * 2021-04-28 2022-10-28 马上消费金融股份有限公司 Labeling method, device, system, equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324620A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for rectifying marking results
CN104573988A (en) * 2015-01-28 2015-04-29 数据堂(北京)科技股份有限公司 Task outsourcing method and system
CN104573359A (en) * 2014-12-31 2015-04-29 浙江大学 Method for integrating crowdsource annotation data based on task difficulty and annotator ability
CN105787521A (en) * 2016-03-25 2016-07-20 浙江大学 Semi-monitoring crowdsourcing marking data integration method facing imbalance of labels
CN106156025A (en) * 2015-03-25 2016-11-23 阿里巴巴集团控股有限公司 The management method of a kind of data mark and device
CN106779079A (en) * 2016-11-23 2017-05-31 北京师范大学 A kind of forecasting system and method that state is grasped based on the knowledge point that multimodal data drives

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324620A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for rectifying marking results
CN104573359A (en) * 2014-12-31 2015-04-29 浙江大学 Method for integrating crowdsource annotation data based on task difficulty and annotator ability
CN104573988A (en) * 2015-01-28 2015-04-29 数据堂(北京)科技股份有限公司 Task outsourcing method and system
CN106156025A (en) * 2015-03-25 2016-11-23 阿里巴巴集团控股有限公司 The management method of a kind of data mark and device
CN105787521A (en) * 2016-03-25 2016-07-20 浙江大学 Semi-monitoring crowdsourcing marking data integration method facing imbalance of labels
CN106779079A (en) * 2016-11-23 2017-05-31 北京师范大学 A kind of forecasting system and method that state is grasped based on the knowledge point that multimodal data drives

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400029A (en) * 2018-04-24 2019-11-01 北京京东尚科信息技术有限公司 A kind of method and system of mark management
CN108681811A (en) * 2018-05-09 2018-10-19 北京慧听科技有限公司 A kind of data ecosystem of decentralization
CN108681811B (en) * 2018-05-09 2022-10-18 北京慧听科技有限公司 Decentralized data ecosystem
CN108984490A (en) * 2018-07-17 2018-12-11 北京猎户星空科技有限公司 A kind of data mask method, device, electronic equipment and storage medium
CN109063043A (en) * 2018-07-17 2018-12-21 北京猎户星空科技有限公司 A kind of data processing method, device, medium and equipment
CN109255582A (en) * 2018-07-24 2019-01-22 武汉空心科技有限公司 Development approach and system based on fault tolerant mechanism
CN109492997A (en) * 2018-10-31 2019-03-19 四川长虹电器股份有限公司 A kind of image labeling plateform system based on SpringBoot
CN111339068A (en) * 2018-12-18 2020-06-26 北京奇虎科技有限公司 Crowdsourcing quality control method, apparatus, computer storage medium and computing device
CN111339068B (en) * 2018-12-18 2024-04-19 北京奇虎科技有限公司 Crowd-sourced quality control method, device, computer storage medium and computing equipment
CN109710933A (en) * 2018-12-25 2019-05-03 广州天鹏计算机科技有限公司 Acquisition methods, device, computer equipment and the storage medium of training corpus
CN110096480A (en) * 2019-03-28 2019-08-06 厦门快商通信息咨询有限公司 A kind of text marking system, method and storage medium
CN111079376A (en) * 2019-11-14 2020-04-28 贝壳技术有限公司 Data labeling method, device, medium and electronic equipment
CN111177132A (en) * 2019-12-20 2020-05-19 中国平安人寿保险股份有限公司 Label cleaning method, device, equipment and storage medium for relational data
CN113032649A (en) * 2019-12-24 2021-06-25 华为技术有限公司 Method and device for labeling data, terminal equipment and storage medium
CN111414950A (en) * 2020-03-13 2020-07-14 天津美腾科技股份有限公司 Ore picture labeling method and system based on professional degree management of annotator
CN111414950B (en) * 2020-03-13 2023-08-18 天津美腾科技股份有限公司 Ore picture labeling method and system based on labeling person professional management
CN111626835A (en) * 2020-04-27 2020-09-04 口碑(上海)信息技术有限公司 Task configuration method, device, system, storage medium and computer equipment
CN111626835B (en) * 2020-04-27 2024-02-02 口碑(上海)信息技术有限公司 Task configuration method, device, system, storage medium and computer equipment
CN111859855A (en) * 2020-06-11 2020-10-30 第四范式(北京)技术有限公司 Method, device and equipment for processing labeling task and storage medium
CN115248831A (en) * 2021-04-28 2022-10-28 马上消费金融股份有限公司 Labeling method, device, system, equipment and readable storage medium
CN115248831B (en) * 2021-04-28 2024-03-15 马上消费金融股份有限公司 Labeling method, labeling device, labeling system, labeling equipment and readable storage medium
CN113554130A (en) * 2021-09-22 2021-10-26 平安科技(深圳)有限公司 Data labeling method and device based on artificial intelligence, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN107729378A (en) A kind of data mask method
US11152119B2 (en) Care path analysis and management platform
CN109670727A (en) A kind of participle mark quality evaluation system and appraisal procedure based on crowdsourcing
CN1842811A (en) Customer service support system
CN105095623B (en) Screening assays, platform, server and the system of disease biomarkers
CN102663182A (en) Intelligent virtual maintenance training system for large equipment
CN107169586A (en) Resource optimization method, device and storage medium based on artificial intelligence
US20200273580A1 (en) Ai powered, fully integrated, end-to-end risk assessment process tool
CN111986744B (en) Patient interface generation method and device for medical institution, electronic equipment and medium
CN104820901A (en) Method for evaluating skill of clothing employees at production line based on production on-site data
JP6613210B2 (en) Human resource development support system
CN110210751A (en) Upkeep operation risk analysis method, device and terminal neural network based
CN113053513A (en) Wisdom medical system based on wisdom community
Younesi Heravi et al. Using fuzzy approach in determining critical parameters for optimum safety functions in mega projects (case study: Iran’s construction industry)
CN116719911A (en) Automatic flow generation method, device, equipment and storage medium
CN113452852B (en) Method and device for regulating and controlling number of outbound calls of machine, electronic equipment and storage medium
CN114862520A (en) Product recommendation method and device, computer equipment and storage medium
CN107783731A (en) A kind of big data real-time processing method and processing system
CN109549655A (en) A kind of Experiment of Psychology and physiological monitoring system and its application method
Fatima et al. Knowledge sharing, a key sustainable practice is on risk: An insight from Modern Code Review
US20200367834A1 (en) Device for predicting body weight of a person and device and method for health management
Chang et al. US National Institutes of Health core consolidation–Investing in greater efficiency
CN114936776A (en) Service data processing method, device, equipment and storage medium
CN113706111A (en) Method, device, equipment and medium for processing medical institution process data
CN106934480A (en) Insure grade analysis method, server and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180223

RJ01 Rejection of invention patent application after publication