CN109241513A - A kind of method and device based on big data crowdsourcing model data mark - Google Patents
A kind of method and device based on big data crowdsourcing model data mark Download PDFInfo
- Publication number
- CN109241513A CN109241513A CN201810980947.1A CN201810980947A CN109241513A CN 109241513 A CN109241513 A CN 109241513A CN 201810980947 A CN201810980947 A CN 201810980947A CN 109241513 A CN109241513 A CN 109241513A
- Authority
- CN
- China
- Prior art keywords
- mark
- data
- crowdsourcing
- big data
- crowdsourcing model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000010354 integration Effects 0.000 claims abstract description 4
- 238000010801 machine learning Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 abstract description 5
- 230000001419 dependent effect Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
Abstract
The present invention relates to a kind of method and devices based on big data crowdsourcing model data mark, the described method comprises the following steps: step S1. provides pre- data mark using automation algorithm notation methods;Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler in the ability need of professional domain to mistake;It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..Its advantage is shown: a kind of method and device based on big data crowdsourcing model data mark of the invention, for the efficiency and training cost problem manually marked, while the problem relatively low for accuracy rate in automation mark.In conjunction with the requirement for reducing artificial ability in crowdsourcing, by improving efficiency, accuracy and scalability in mark the advantages of matching algorithm combination three.
Description
Technical field
The present invention relates to the mark field of the data such as user's evaluation, user's question and answer, picture, big data field and computer
Technical field technical field, specifically, being a kind of method and device based on big data crowdsourcing model data mark.
Background technique
In recent years, with the rapid development of computer technology and internet, there are various forms of big datas, however count
Make manually to mark corpus according to the increase of amount and becomes abnormal difficult and of a high price.Data mark is the base in big data processing
Plinth link, how to generate the markup information of high quality is the challenge to existing big data algorithm system.
Available data mark system is mainly divided several:
The first is the data mark based on manual identified, is completely dependent on the priori knowledge of the mankind to carry out text marking.
Second is the automation mask method based on template, rule or machine learning.
The third is the mask method based on semi-automatic mode, and the fuzzy mark of meaning is thought after mainly being marked by machine
Carry out desk checking.
In current data mark system, for manually marking system, main problem is that it is completely dependent on manual method,
Accuracy can improve, but have certain requirement to the knowledge hierarchy of labeler itself, and mark efficiency and
Cost is all very high.And the text marking based on machine learning, it is completely dependent on the knowledge of algorithm research person, efficiency is very high, still
There are certain shortcomings for accuracy, poor to many special circumstances adaptability, currently often to the verification of the accuracy of method two
All or by way of manually proofreading, and for new mark demand lack quick adaptability.
Under current big data and the collective effect of AI, the new related data of over ten billion item, this logarithm can be all generated daily
Great challenge is brought according to mark, main problem concentrates on the accuracy of mark, the efficiency of mark, the cost and mark of mark
Many challenges such as scalability.
According to current data volume, if it is desired to obtaining the data of a high quality marked, manual method has been applied alone
It can not be achieved, the labeler for being otherwise engaged in text or image labeling generally requires very strong ability and just can guarantee data
Consistency.
And full-automatic method is there is also certain problem, mass data first may not shown in the algorithm
Very strong uncertainty is only derived from the possible phraseological change of partial information therein.Second, spoken and written languages be
In constantly evolving, new grammatical phenomenon, the even relevant knowledges such as language setting how are adapted to.
Most of all, new model new business is constantly born in Internet era, old mark can not often be suitable for new
Business scenario, this results in the need for bringing the scalability of mark very big requirement.Traditional mark system is to newly marking
Extension is often exactly to make a new start, this also brings very big burden to system design.
Chinese patent literature CN201510130022.4, the applying date 20150325, patent name are as follows: a kind of data mark
Management method and device disclose the management method and device of a kind of data mark.Its method includes: that acquisition and data mark are appointed
It is engaged in corresponding data set, and mark corresponding with Various types of data in data set rule;Data set is divided into data subset;According to
Mark corresponding with the Various types of data rule obtained generates the data mark subtask description information of data subset and issues;To
The data subset is sent to the first sender for claiming request of data subset;Receive the sender's for claiming request from first
Data after mark, wherein the management method further includes by the calling of annotation tool corresponding with Various types of data in data subset
Instruction is sent;And/or containing all kinds of in the data subset in the data mark subtask description information of the data subset of publication
Data format after the target mark of data.
Above patent document, which is avoided, carries out Data Format Transform to the data after mark, but about one kind for artificial
The efficiency and training cost problem of mark, while the problem relatively low for accuracy rate in automation mark.In conjunction with crowd
The requirement that artificial ability is reduced in packet, by improved the advantages of matching algorithm combination three mark in efficiency, accuracy and
The technical solution of scalability is then without corresponding open.
It is a kind of for the efficiency manually marked and training cost problem in summary, while in automation mark
The relatively low problem of accuracy rate.In conjunction with the requirement for reducing artificial ability in crowdsourcing, the advantages of by matching algorithm combination three
The data mask method and device of the efficiency in mark, accuracy and scalability are improved, and about this data mark side
Method and device yet there are no report.
Summary of the invention
It is a kind of for the efficiency and training that manually mark the purpose of the present invention is aiming at the shortcomings in the prior art, providing
Instruct cost problem, while the problem relatively low for accuracy rate in automation mark.In conjunction with reducing artificial ability in crowdsourcing
It is required that being marked by the data for improving the efficiency in mark, accuracy and scalability the advantages of matching algorithm combination three
Method.
Another object of the present invention is: providing a kind of device based on big data crowdsourcing model data mark.
To achieve the above object, the technical solution adopted by the present invention is that:
A method of it is marked, be the described method comprises the following steps based on big data crowdsourcing model data:
Step S1. provides pre- data mark using automation algorithm notation methods;
Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler and lead in profession to mistake
The ability need in domain;
It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..
As a kind of perferred technical scheme, depth LSTM, bi- are based on to the algorithm that data are labeled in step S1
The classifier of LSTM, GRU or CNN neural network design.
As a kind of perferred technical scheme, following by the way of generation to unbalanced mark using rule and similar mark
It is marked repeatedly with new mark.
As a kind of perferred technical scheme, by the statistics to crowdsourcing result after the completion of being marked in step S2, to new mark
Note provides a large amount of machine learning resource, carries out the promotion of machine learning learning effect while mark.
As a kind of perferred technical scheme, the method also includes the forms of new type mark and false information shuffling.
As a kind of perferred technical scheme, the method in such a way that rule and similar mark follow to it is new mark into
Row shows the mode of judgement.
As a kind of perferred technical scheme, the crowdsourcing model in step S2 is by the way of based on simple True-False, and
It is not limited in the mode of judgement.
To realize above-mentioned second purpose, the technical solution adopted by the present invention is that:
A kind of device based on big data crowdsourcing model data mark, described device include at least one method described above
On application.
The invention has the advantages that:
1, a kind of method and device based on big data crowdsourcing model data mark of the invention, for the effect manually marked
Rate problem and training cost problem, while the problem relatively low for accuracy rate in automation mark.In conjunction with being reduced in crowdsourcing
The requirement of artificial ability, by improving efficiency, accuracy and scalability in mark the advantages of matching algorithm combination three.
2, the present invention uses crowdsourcing technology, and crowdsourcing technology is a kind of new technology of big data field, and main purpose is to reduce
Artificial or automation error message when data generate, current application and research are mainly concentrated in administer in crowdsourcing and occur
Mark horizontal problem and crowdsourcing result improvement in terms of, improve accuracy field.
3, when being marked in advance to basic data, depth LSTM, bi-LSTM are based on to the algorithm that data are labeled,
The classifier of the neural networks such as GRU or CNN design can use transfer learning when carrying out classification learning to new mark
Mode accelerate the learning rate of algorithm, improve efficiency the adaptability with new business.
4, when being distributed using crowdsourcing model to all marks, using the side for following generation of rule and similar mark
Formula marks unbalanced mark and new mark repeatedly, is adapted to facilitating to new business and new demand, enables data
To it is close mark or rule in difference and details distinguish, and old data can be marked again and vacation
The means such as information back annotation, to improve the accuracy marked always.
It 5, can also be by the statistics to crowdsourcing result, to new after being distributed using crowdsourcing model to all marks
Mark provides a large amount of education resource, carries out study promotion while mark, improves the verification efficiency manually marked.
5, marked in advance using automation algorithm, artificial mark only judgement automation mark to mistake, when avoiding artificial cognition
Extra information select bring efficiency.
6, data mark is marked using automation and new type marks and the form of false information shuffling, to provide preferably letter
Breath compares to be collected with crowdsourcing algorithm, data corruption caused by avoiding artificially.
Detailed description of the invention
Attached drawing 1 is a kind of flow diagram of method based on big data crowdsourcing model data of the invention.
Specific embodiment
It elaborates with reference to the accompanying drawing to specific embodiment provided by the invention.
Fig. 1 is please referred to, Fig. 1 is a kind of flow diagram of method based on big data crowdsourcing model data of the invention.
A method of based on big data crowdsourcing model data, the described method comprises the following steps:
Step S1. provides pre- data mark using automation algorithm notation methods;
Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler and lead in profession to mistake
The ability need in domain;
It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..
Wherein step S1 is based on depth LSTM, the nerve nets such as bi-LSTM, GRU or CNN to the algorithm that data are labeled
The classifier of network design can accelerate algorithm when carrying out classification learning to new mark using the mode of transfer learning
Rate is practised, the adaptability with new business is improved efficiency.
Following by the way of generation to unbalanced mark and new mark progress using rule and similar mark in step s 2
It marks, new business and new demand is adapted to repeatedly with facilitating, enable data to the difference in close mark or rule
Point and details distinguish, and can be marked again to old data and the means such as false information back annotation, with raising
The accuracy marked always.
After the completion of step S2, a large amount of education resource can also be provided to new mark by the statistics to crowdsourcing result,
Study promotion is carried out while mark, improves the verification efficiency manually marked.
In the embodiment it should be understood that
Crowdsourcing model reduces manually-operated complexity based on the mode of simple True-False in the example, it is not limited to
The mode of judgement, other are capable of the algorithm of the artificial matching result of quick obtaining and methods of exhibiting while should also cover in the invention
Under conception.
It is shown the mode of judgement in the example to new mark in such a way that rule and similar mark follow, but not only
It is limited to upper type, other algorithms that can be improved new mark matching capacity should also be covered under the conception of the invention.
The example is marked in advance using automation algorithm, artificial mark only judgement automation mark to mistake, avoid manually sentencing
Extra information when other selects bring efficiency.
Instance data mark is marked using automation and the form of new type mark and false information shuffling, to provide more preferably
Information compare and collected with crowdsourcing algorithm, avoid it is artificial caused by data corruption.
A kind of device based on big data crowdsourcing model data mark, described device include at least one method described above
On application.
The present invention once puts into application, and following technical effect may be implemented:
In the case where simply judging, a possibility that reducing the ability need of labeler, and reduce error.Through surveying
It calculates, is manually complicated 20 times for marking selection to simply mark judgement.But due to the mode of crowdsourcing, need to multiple marks
Person carries out data distribution, so 20 times efficient is theoretically not achieved, but the ability that this method greatly reduces labeler needs
It asks, in the case where improving efficiency and do not reduce accuracy, improves the expandability for the system that greatly increases.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
Member, under the premise of not departing from the method for the present invention, can also make several improvement and supplement, these are improved and supplement also should be regarded as
Protection scope of the present invention.
Claims (8)
1. a kind of method based on big data crowdsourcing model data mark, which is characterized in that the described method comprises the following steps:
Step S1. provides pre- data mark using automation algorithm notation methods;
Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler in professional domain to mistake
Ability need;
It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..
2. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S1
Depth LSTM, bi-LSTM, the classifier of GRU CNN neural network design are based on to the algorithm that data are labeled.
3. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that using rule
The mode of generation is followed to mark unbalanced mark and new mark repeatedly with similar mark.
4. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S2
By the statistics to crowdsourcing result after the completion of mark, provide a large amount of machine learning resource to new mark, while mark into
Row machine learning learning effect is promoted.
5. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that the method
It further include the form of new type mark and false information shuffling.
6. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that the method
The mode of judgement is shown to new mark in such a way that rule and similar mark follow.
7. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S2
Crowdsourcing model by the way of based on simple True-False, and be not limited in judgement mode.
8. a kind of device based on big data crowdsourcing model data mark, which is characterized in that described device is appointed in claim 1-7
Application in one the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810980947.1A CN109241513A (en) | 2018-08-27 | 2018-08-27 | A kind of method and device based on big data crowdsourcing model data mark |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810980947.1A CN109241513A (en) | 2018-08-27 | 2018-08-27 | A kind of method and device based on big data crowdsourcing model data mark |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109241513A true CN109241513A (en) | 2019-01-18 |
Family
ID=65068498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810980947.1A Pending CN109241513A (en) | 2018-08-27 | 2018-08-27 | A kind of method and device based on big data crowdsourcing model data mark |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109241513A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109993315A (en) * | 2019-03-29 | 2019-07-09 | 联想(北京)有限公司 | A kind of data processing method, device and electronic equipment |
CN110135289A (en) * | 2019-04-28 | 2019-08-16 | 北京天地玛珂电液控制系统有限公司 | A kind of underground coal mine intelligent use cloud service platform based on deep learning |
CN110597240A (en) * | 2019-10-24 | 2019-12-20 | 福州大学 | Hydroelectric generating set fault diagnosis method based on deep learning |
CN110647985A (en) * | 2019-08-02 | 2020-01-03 | 杭州电子科技大学 | Crowdsourcing data labeling method based on artificial intelligence model library |
CN111683131A (en) * | 2020-06-01 | 2020-09-18 | 深圳大学 | Disaster monitoring method, device, equipment and storage medium based on crowdsourcing mode |
CN111985394A (en) * | 2020-08-19 | 2020-11-24 | 东南大学 | Semi-automatic instance labeling method and system for KITTI data set |
CN113297902A (en) * | 2021-04-14 | 2021-08-24 | 中国科学院计算机网络信息中心 | Method and device for generating sample data set by marking remote sensing image on line based on crowdsourcing mode |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104573130A (en) * | 2015-02-12 | 2015-04-29 | 北京航空航天大学 | Entity resolution method based on group calculation and entity resolution device based on group calculation |
CN105426826A (en) * | 2015-11-09 | 2016-03-23 | 张静 | Tag noise correction based crowd-sourced tagging data quality improvement method |
CN107247972A (en) * | 2017-06-29 | 2017-10-13 | 哈尔滨工程大学 | One kind is based on mass-rent technology classification model training method |
-
2018
- 2018-08-27 CN CN201810980947.1A patent/CN109241513A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104573130A (en) * | 2015-02-12 | 2015-04-29 | 北京航空航天大学 | Entity resolution method based on group calculation and entity resolution device based on group calculation |
CN105426826A (en) * | 2015-11-09 | 2016-03-23 | 张静 | Tag noise correction based crowd-sourced tagging data quality improvement method |
CN107247972A (en) * | 2017-06-29 | 2017-10-13 | 哈尔滨工程大学 | One kind is based on mass-rent technology classification model training method |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109993315A (en) * | 2019-03-29 | 2019-07-09 | 联想(北京)有限公司 | A kind of data processing method, device and electronic equipment |
CN109993315B (en) * | 2019-03-29 | 2021-05-18 | 联想(北京)有限公司 | Data processing method and device and electronic equipment |
CN110135289A (en) * | 2019-04-28 | 2019-08-16 | 北京天地玛珂电液控制系统有限公司 | A kind of underground coal mine intelligent use cloud service platform based on deep learning |
CN110647985A (en) * | 2019-08-02 | 2020-01-03 | 杭州电子科技大学 | Crowdsourcing data labeling method based on artificial intelligence model library |
CN110597240A (en) * | 2019-10-24 | 2019-12-20 | 福州大学 | Hydroelectric generating set fault diagnosis method based on deep learning |
CN110597240B (en) * | 2019-10-24 | 2021-03-30 | 福州大学 | Hydroelectric generating set fault diagnosis method based on deep learning |
CN111683131A (en) * | 2020-06-01 | 2020-09-18 | 深圳大学 | Disaster monitoring method, device, equipment and storage medium based on crowdsourcing mode |
CN111683131B (en) * | 2020-06-01 | 2021-10-22 | 深圳大学 | Disaster monitoring method, device, equipment and storage medium based on crowdsourcing mode |
CN111985394A (en) * | 2020-08-19 | 2020-11-24 | 东南大学 | Semi-automatic instance labeling method and system for KITTI data set |
CN111985394B (en) * | 2020-08-19 | 2021-05-28 | 东南大学 | Semi-automatic instance labeling method and system for KITTI data set |
CN113297902A (en) * | 2021-04-14 | 2021-08-24 | 中国科学院计算机网络信息中心 | Method and device for generating sample data set by marking remote sensing image on line based on crowdsourcing mode |
CN113297902B (en) * | 2021-04-14 | 2023-08-08 | 中国科学院计算机网络信息中心 | Method and device for generating sample data set based on crowdsourcing mode on-line labeling remote sensing image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241513A (en) | A kind of method and device based on big data crowdsourcing model data mark | |
CN112115736B (en) | Job correction method and system based on image recognition and intelligent terminal | |
CN107705652A (en) | A kind of teaching system | |
CN106373444B (en) | A kind of Multifunctional English classroom with English teaching aid | |
CN110472494A (en) | Face feature extracts model training method, facial feature extraction method, device, equipment and storage medium | |
CN108399525A (en) | A kind of talent's appraisal procedure based on data mining and machine learning | |
WO2022170985A1 (en) | Exercise selection method and apparatus, and computer device and storage medium | |
CN109657675B (en) | Image annotation method and device, computer equipment and readable storage medium | |
Duy Khuat et al. | Vietnamese sign language detection using Mediapipe | |
CN110489747A (en) | A kind of image processing method, device, storage medium and electronic equipment | |
CN106777336A (en) | A kind of exabyte composition extraction system and method based on deep learning | |
CN103413470B (en) | C language teaching programming examination system ensemble and method | |
CN109740473A (en) | A kind of image content automark method and system based on marking system | |
CN111223015A (en) | Course recommendation method and device and terminal equipment | |
CN108491459A (en) | Optimization method for software code abstract automatic generation model | |
CN112116840B (en) | Job correction method and system based on image recognition and intelligent terminal | |
CN110348328A (en) | Appraisal procedure, device, storage medium and the electronic equipment of quality of instruction | |
CN113888757A (en) | Examination paper intelligent analysis method, examination paper intelligent analysis system and storage medium based on benchmarking evaluation | |
CN108172063A (en) | A kind of intelligence is practised handwriting copybook generation method and system | |
KR20190068841A (en) | System for training and evaluation of english pronunciation using artificial intelligence speech recognition application programming interface | |
CN104732320A (en) | Computer professional technical ability verification training system | |
CN108305193B (en) | Dynamic course creation method and system | |
JP2004094521A (en) | Inquiry type learning method, learning device, inquiry type learning program, recording medium recorded with the program, recording medium recorded with learning data, inquiry type identification method and device using learning data, program, and recording medium with the program | |
CN113963306B (en) | Courseware title making method and device based on artificial intelligence | |
CN109543512A (en) | The evaluation method of picture and text abstract |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190118 |
|
RJ01 | Rejection of invention patent application after publication |