CN109241513A - A kind of method and device based on big data crowdsourcing model data mark - Google Patents

A kind of method and device based on big data crowdsourcing model data mark Download PDF

Info

Publication number
CN109241513A
CN109241513A CN201810980947.1A CN201810980947A CN109241513A CN 109241513 A CN109241513 A CN 109241513A CN 201810980947 A CN201810980947 A CN 201810980947A CN 109241513 A CN109241513 A CN 109241513A
Authority
CN
China
Prior art keywords
mark
data
crowdsourcing
big data
crowdsourcing model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810980947.1A
Other languages
Chinese (zh)
Inventor
黄佳威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bao Zun Agel Ecommerce Ltd
Original Assignee
Shanghai Bao Zun Agel Ecommerce Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bao Zun Agel Ecommerce Ltd filed Critical Shanghai Bao Zun Agel Ecommerce Ltd
Priority to CN201810980947.1A priority Critical patent/CN109241513A/en
Publication of CN109241513A publication Critical patent/CN109241513A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes

Abstract

The present invention relates to a kind of method and devices based on big data crowdsourcing model data mark, the described method comprises the following steps: step S1. provides pre- data mark using automation algorithm notation methods;Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler in the ability need of professional domain to mistake;It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..Its advantage is shown: a kind of method and device based on big data crowdsourcing model data mark of the invention, for the efficiency and training cost problem manually marked, while the problem relatively low for accuracy rate in automation mark.In conjunction with the requirement for reducing artificial ability in crowdsourcing, by improving efficiency, accuracy and scalability in mark the advantages of matching algorithm combination three.

Description

A kind of method and device based on big data crowdsourcing model data mark
Technical field
The present invention relates to the mark field of the data such as user's evaluation, user's question and answer, picture, big data field and computer Technical field technical field, specifically, being a kind of method and device based on big data crowdsourcing model data mark.
Background technique
In recent years, with the rapid development of computer technology and internet, there are various forms of big datas, however count Make manually to mark corpus according to the increase of amount and becomes abnormal difficult and of a high price.Data mark is the base in big data processing Plinth link, how to generate the markup information of high quality is the challenge to existing big data algorithm system.
Available data mark system is mainly divided several:
The first is the data mark based on manual identified, is completely dependent on the priori knowledge of the mankind to carry out text marking.
Second is the automation mask method based on template, rule or machine learning.
The third is the mask method based on semi-automatic mode, and the fuzzy mark of meaning is thought after mainly being marked by machine Carry out desk checking.
In current data mark system, for manually marking system, main problem is that it is completely dependent on manual method, Accuracy can improve, but have certain requirement to the knowledge hierarchy of labeler itself, and mark efficiency and Cost is all very high.And the text marking based on machine learning, it is completely dependent on the knowledge of algorithm research person, efficiency is very high, still There are certain shortcomings for accuracy, poor to many special circumstances adaptability, currently often to the verification of the accuracy of method two All or by way of manually proofreading, and for new mark demand lack quick adaptability.
Under current big data and the collective effect of AI, the new related data of over ten billion item, this logarithm can be all generated daily Great challenge is brought according to mark, main problem concentrates on the accuracy of mark, the efficiency of mark, the cost and mark of mark Many challenges such as scalability.
According to current data volume, if it is desired to obtaining the data of a high quality marked, manual method has been applied alone It can not be achieved, the labeler for being otherwise engaged in text or image labeling generally requires very strong ability and just can guarantee data Consistency.
And full-automatic method is there is also certain problem, mass data first may not shown in the algorithm Very strong uncertainty is only derived from the possible phraseological change of partial information therein.Second, spoken and written languages be In constantly evolving, new grammatical phenomenon, the even relevant knowledges such as language setting how are adapted to.
Most of all, new model new business is constantly born in Internet era, old mark can not often be suitable for new Business scenario, this results in the need for bringing the scalability of mark very big requirement.Traditional mark system is to newly marking Extension is often exactly to make a new start, this also brings very big burden to system design.
Chinese patent literature CN201510130022.4, the applying date 20150325, patent name are as follows: a kind of data mark Management method and device disclose the management method and device of a kind of data mark.Its method includes: that acquisition and data mark are appointed It is engaged in corresponding data set, and mark corresponding with Various types of data in data set rule;Data set is divided into data subset;According to Mark corresponding with the Various types of data rule obtained generates the data mark subtask description information of data subset and issues;To The data subset is sent to the first sender for claiming request of data subset;Receive the sender's for claiming request from first Data after mark, wherein the management method further includes by the calling of annotation tool corresponding with Various types of data in data subset Instruction is sent;And/or containing all kinds of in the data subset in the data mark subtask description information of the data subset of publication Data format after the target mark of data.
Above patent document, which is avoided, carries out Data Format Transform to the data after mark, but about one kind for artificial The efficiency and training cost problem of mark, while the problem relatively low for accuracy rate in automation mark.In conjunction with crowd The requirement that artificial ability is reduced in packet, by improved the advantages of matching algorithm combination three mark in efficiency, accuracy and The technical solution of scalability is then without corresponding open.
It is a kind of for the efficiency manually marked and training cost problem in summary, while in automation mark The relatively low problem of accuracy rate.In conjunction with the requirement for reducing artificial ability in crowdsourcing, the advantages of by matching algorithm combination three The data mask method and device of the efficiency in mark, accuracy and scalability are improved, and about this data mark side Method and device yet there are no report.
Summary of the invention
It is a kind of for the efficiency and training that manually mark the purpose of the present invention is aiming at the shortcomings in the prior art, providing Instruct cost problem, while the problem relatively low for accuracy rate in automation mark.In conjunction with reducing artificial ability in crowdsourcing It is required that being marked by the data for improving the efficiency in mark, accuracy and scalability the advantages of matching algorithm combination three Method.
Another object of the present invention is: providing a kind of device based on big data crowdsourcing model data mark.
To achieve the above object, the technical solution adopted by the present invention is that:
A method of it is marked, be the described method comprises the following steps based on big data crowdsourcing model data:
Step S1. provides pre- data mark using automation algorithm notation methods;
Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler and lead in profession to mistake The ability need in domain;
It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..
As a kind of perferred technical scheme, depth LSTM, bi- are based on to the algorithm that data are labeled in step S1 The classifier of LSTM, GRU or CNN neural network design.
As a kind of perferred technical scheme, following by the way of generation to unbalanced mark using rule and similar mark It is marked repeatedly with new mark.
As a kind of perferred technical scheme, by the statistics to crowdsourcing result after the completion of being marked in step S2, to new mark Note provides a large amount of machine learning resource, carries out the promotion of machine learning learning effect while mark.
As a kind of perferred technical scheme, the method also includes the forms of new type mark and false information shuffling.
As a kind of perferred technical scheme, the method in such a way that rule and similar mark follow to it is new mark into Row shows the mode of judgement.
As a kind of perferred technical scheme, the crowdsourcing model in step S2 is by the way of based on simple True-False, and It is not limited in the mode of judgement.
To realize above-mentioned second purpose, the technical solution adopted by the present invention is that:
A kind of device based on big data crowdsourcing model data mark, described device include at least one method described above On application.
The invention has the advantages that:
1, a kind of method and device based on big data crowdsourcing model data mark of the invention, for the effect manually marked Rate problem and training cost problem, while the problem relatively low for accuracy rate in automation mark.In conjunction with being reduced in crowdsourcing The requirement of artificial ability, by improving efficiency, accuracy and scalability in mark the advantages of matching algorithm combination three.
2, the present invention uses crowdsourcing technology, and crowdsourcing technology is a kind of new technology of big data field, and main purpose is to reduce Artificial or automation error message when data generate, current application and research are mainly concentrated in administer in crowdsourcing and occur Mark horizontal problem and crowdsourcing result improvement in terms of, improve accuracy field.
3, when being marked in advance to basic data, depth LSTM, bi-LSTM are based on to the algorithm that data are labeled, The classifier of the neural networks such as GRU or CNN design can use transfer learning when carrying out classification learning to new mark Mode accelerate the learning rate of algorithm, improve efficiency the adaptability with new business.
4, when being distributed using crowdsourcing model to all marks, using the side for following generation of rule and similar mark Formula marks unbalanced mark and new mark repeatedly, is adapted to facilitating to new business and new demand, enables data To it is close mark or rule in difference and details distinguish, and old data can be marked again and vacation The means such as information back annotation, to improve the accuracy marked always.
It 5, can also be by the statistics to crowdsourcing result, to new after being distributed using crowdsourcing model to all marks Mark provides a large amount of education resource, carries out study promotion while mark, improves the verification efficiency manually marked.
5, marked in advance using automation algorithm, artificial mark only judgement automation mark to mistake, when avoiding artificial cognition Extra information select bring efficiency.
6, data mark is marked using automation and new type marks and the form of false information shuffling, to provide preferably letter Breath compares to be collected with crowdsourcing algorithm, data corruption caused by avoiding artificially.
Detailed description of the invention
Attached drawing 1 is a kind of flow diagram of method based on big data crowdsourcing model data of the invention.
Specific embodiment
It elaborates with reference to the accompanying drawing to specific embodiment provided by the invention.
Fig. 1 is please referred to, Fig. 1 is a kind of flow diagram of method based on big data crowdsourcing model data of the invention. A method of based on big data crowdsourcing model data, the described method comprises the following steps:
Step S1. provides pre- data mark using automation algorithm notation methods;
Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler and lead in profession to mistake The ability need in domain;
It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..
Wherein step S1 is based on depth LSTM, the nerve nets such as bi-LSTM, GRU or CNN to the algorithm that data are labeled The classifier of network design can accelerate algorithm when carrying out classification learning to new mark using the mode of transfer learning Rate is practised, the adaptability with new business is improved efficiency.
Following by the way of generation to unbalanced mark and new mark progress using rule and similar mark in step s 2 It marks, new business and new demand is adapted to repeatedly with facilitating, enable data to the difference in close mark or rule Point and details distinguish, and can be marked again to old data and the means such as false information back annotation, with raising The accuracy marked always.
After the completion of step S2, a large amount of education resource can also be provided to new mark by the statistics to crowdsourcing result, Study promotion is carried out while mark, improves the verification efficiency manually marked.
In the embodiment it should be understood that
Crowdsourcing model reduces manually-operated complexity based on the mode of simple True-False in the example, it is not limited to The mode of judgement, other are capable of the algorithm of the artificial matching result of quick obtaining and methods of exhibiting while should also cover in the invention Under conception.
It is shown the mode of judgement in the example to new mark in such a way that rule and similar mark follow, but not only It is limited to upper type, other algorithms that can be improved new mark matching capacity should also be covered under the conception of the invention.
The example is marked in advance using automation algorithm, artificial mark only judgement automation mark to mistake, avoid manually sentencing Extra information when other selects bring efficiency.
Instance data mark is marked using automation and the form of new type mark and false information shuffling, to provide more preferably Information compare and collected with crowdsourcing algorithm, avoid it is artificial caused by data corruption.
A kind of device based on big data crowdsourcing model data mark, described device include at least one method described above On application.
The present invention once puts into application, and following technical effect may be implemented:
In the case where simply judging, a possibility that reducing the ability need of labeler, and reduce error.Through surveying It calculates, is manually complicated 20 times for marking selection to simply mark judgement.But due to the mode of crowdsourcing, need to multiple marks Person carries out data distribution, so 20 times efficient is theoretically not achieved, but the ability that this method greatly reduces labeler needs It asks, in the case where improving efficiency and do not reduce accuracy, improves the expandability for the system that greatly increases.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise of not departing from the method for the present invention, can also make several improvement and supplement, these are improved and supplement also should be regarded as Protection scope of the present invention.

Claims (8)

1. a kind of method based on big data crowdsourcing model data mark, which is characterized in that the described method comprises the following steps:
Step S1. provides pre- data mark using automation algorithm notation methods;
Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler in professional domain to mistake Ability need;
It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..
2. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S1 Depth LSTM, bi-LSTM, the classifier of GRU CNN neural network design are based on to the algorithm that data are labeled.
3. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that using rule The mode of generation is followed to mark unbalanced mark and new mark repeatedly with similar mark.
4. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S2 By the statistics to crowdsourcing result after the completion of mark, provide a large amount of machine learning resource to new mark, while mark into Row machine learning learning effect is promoted.
5. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that the method It further include the form of new type mark and false information shuffling.
6. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that the method The mode of judgement is shown to new mark in such a way that rule and similar mark follow.
7. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S2 Crowdsourcing model by the way of based on simple True-False, and be not limited in judgement mode.
8. a kind of device based on big data crowdsourcing model data mark, which is characterized in that described device is appointed in claim 1-7 Application in one the method.
CN201810980947.1A 2018-08-27 2018-08-27 A kind of method and device based on big data crowdsourcing model data mark Pending CN109241513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810980947.1A CN109241513A (en) 2018-08-27 2018-08-27 A kind of method and device based on big data crowdsourcing model data mark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810980947.1A CN109241513A (en) 2018-08-27 2018-08-27 A kind of method and device based on big data crowdsourcing model data mark

Publications (1)

Publication Number Publication Date
CN109241513A true CN109241513A (en) 2019-01-18

Family

ID=65068498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810980947.1A Pending CN109241513A (en) 2018-08-27 2018-08-27 A kind of method and device based on big data crowdsourcing model data mark

Country Status (1)

Country Link
CN (1) CN109241513A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993315A (en) * 2019-03-29 2019-07-09 联想(北京)有限公司 A kind of data processing method, device and electronic equipment
CN110135289A (en) * 2019-04-28 2019-08-16 北京天地玛珂电液控制系统有限公司 A kind of underground coal mine intelligent use cloud service platform based on deep learning
CN110597240A (en) * 2019-10-24 2019-12-20 福州大学 Hydroelectric generating set fault diagnosis method based on deep learning
CN110647985A (en) * 2019-08-02 2020-01-03 杭州电子科技大学 Crowdsourcing data labeling method based on artificial intelligence model library
CN111683131A (en) * 2020-06-01 2020-09-18 深圳大学 Disaster monitoring method, device, equipment and storage medium based on crowdsourcing mode
CN111985394A (en) * 2020-08-19 2020-11-24 东南大学 Semi-automatic instance labeling method and system for KITTI data set
CN113297902A (en) * 2021-04-14 2021-08-24 中国科学院计算机网络信息中心 Method and device for generating sample data set by marking remote sensing image on line based on crowdsourcing mode

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573130A (en) * 2015-02-12 2015-04-29 北京航空航天大学 Entity resolution method based on group calculation and entity resolution device based on group calculation
CN105426826A (en) * 2015-11-09 2016-03-23 张静 Tag noise correction based crowd-sourced tagging data quality improvement method
CN107247972A (en) * 2017-06-29 2017-10-13 哈尔滨工程大学 One kind is based on mass-rent technology classification model training method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573130A (en) * 2015-02-12 2015-04-29 北京航空航天大学 Entity resolution method based on group calculation and entity resolution device based on group calculation
CN105426826A (en) * 2015-11-09 2016-03-23 张静 Tag noise correction based crowd-sourced tagging data quality improvement method
CN107247972A (en) * 2017-06-29 2017-10-13 哈尔滨工程大学 One kind is based on mass-rent technology classification model training method

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993315A (en) * 2019-03-29 2019-07-09 联想(北京)有限公司 A kind of data processing method, device and electronic equipment
CN109993315B (en) * 2019-03-29 2021-05-18 联想(北京)有限公司 Data processing method and device and electronic equipment
CN110135289A (en) * 2019-04-28 2019-08-16 北京天地玛珂电液控制系统有限公司 A kind of underground coal mine intelligent use cloud service platform based on deep learning
CN110647985A (en) * 2019-08-02 2020-01-03 杭州电子科技大学 Crowdsourcing data labeling method based on artificial intelligence model library
CN110597240A (en) * 2019-10-24 2019-12-20 福州大学 Hydroelectric generating set fault diagnosis method based on deep learning
CN110597240B (en) * 2019-10-24 2021-03-30 福州大学 Hydroelectric generating set fault diagnosis method based on deep learning
CN111683131A (en) * 2020-06-01 2020-09-18 深圳大学 Disaster monitoring method, device, equipment and storage medium based on crowdsourcing mode
CN111683131B (en) * 2020-06-01 2021-10-22 深圳大学 Disaster monitoring method, device, equipment and storage medium based on crowdsourcing mode
CN111985394A (en) * 2020-08-19 2020-11-24 东南大学 Semi-automatic instance labeling method and system for KITTI data set
CN111985394B (en) * 2020-08-19 2021-05-28 东南大学 Semi-automatic instance labeling method and system for KITTI data set
CN113297902A (en) * 2021-04-14 2021-08-24 中国科学院计算机网络信息中心 Method and device for generating sample data set by marking remote sensing image on line based on crowdsourcing mode
CN113297902B (en) * 2021-04-14 2023-08-08 中国科学院计算机网络信息中心 Method and device for generating sample data set based on crowdsourcing mode on-line labeling remote sensing image

Similar Documents

Publication Publication Date Title
CN109241513A (en) A kind of method and device based on big data crowdsourcing model data mark
CN112115736B (en) Job correction method and system based on image recognition and intelligent terminal
CN107705652A (en) A kind of teaching system
CN106373444B (en) A kind of Multifunctional English classroom with English teaching aid
CN110472494A (en) Face feature extracts model training method, facial feature extraction method, device, equipment and storage medium
CN108399525A (en) A kind of talent's appraisal procedure based on data mining and machine learning
WO2022170985A1 (en) Exercise selection method and apparatus, and computer device and storage medium
CN109657675B (en) Image annotation method and device, computer equipment and readable storage medium
Duy Khuat et al. Vietnamese sign language detection using Mediapipe
CN110489747A (en) A kind of image processing method, device, storage medium and electronic equipment
CN106777336A (en) A kind of exabyte composition extraction system and method based on deep learning
CN103413470B (en) C language teaching programming examination system ensemble and method
CN109740473A (en) A kind of image content automark method and system based on marking system
CN111223015A (en) Course recommendation method and device and terminal equipment
CN108491459A (en) Optimization method for software code abstract automatic generation model
CN112116840B (en) Job correction method and system based on image recognition and intelligent terminal
CN110348328A (en) Appraisal procedure, device, storage medium and the electronic equipment of quality of instruction
CN113888757A (en) Examination paper intelligent analysis method, examination paper intelligent analysis system and storage medium based on benchmarking evaluation
CN108172063A (en) A kind of intelligence is practised handwriting copybook generation method and system
KR20190068841A (en) System for training and evaluation of english pronunciation using artificial intelligence speech recognition application programming interface
CN104732320A (en) Computer professional technical ability verification training system
CN108305193B (en) Dynamic course creation method and system
JP2004094521A (en) Inquiry type learning method, learning device, inquiry type learning program, recording medium recorded with the program, recording medium recorded with learning data, inquiry type identification method and device using learning data, program, and recording medium with the program
CN113963306B (en) Courseware title making method and device based on artificial intelligence
CN109543512A (en) The evaluation method of picture and text abstract

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190118

RJ01 Rejection of invention patent application after publication