CN109241513A

CN109241513A - A kind of method and device based on big data crowdsourcing model data mark

Info

Publication number: CN109241513A
Application number: CN201810980947.1A
Authority: CN
Inventors: 黄佳威
Original assignee: Shanghai Bao Zun Agel Ecommerce Ltd
Current assignee: Shanghai Bao Zun Agel Ecommerce Ltd
Priority date: 2018-08-27
Filing date: 2018-08-27
Publication date: 2019-01-18

Abstract

The present invention relates to a kind of method and devices based on big data crowdsourcing model data mark, the described method comprises the following steps: step S1. provides pre- data mark using automation algorithm notation methods；Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler in the ability need of professional domain to mistake；It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..Its advantage is shown: a kind of method and device based on big data crowdsourcing model data mark of the invention, for the efficiency and training cost problem manually marked, while the problem relatively low for accuracy rate in automation mark.In conjunction with the requirement for reducing artificial ability in crowdsourcing, by improving efficiency, accuracy and scalability in mark the advantages of matching algorithm combination three.

Description

A kind of method and device based on big data crowdsourcing model data mark

Technical field

The present invention relates to the mark field of the data such as user's evaluation, user's question and answer, picture, big data field and computer Technical field technical field, specifically, being a kind of method and device based on big data crowdsourcing model data mark.

Background technique

In recent years, with the rapid development of computer technology and internet, there are various forms of big datas, however count Make manually to mark corpus according to the increase of amount and becomes abnormal difficult and of a high price.Data mark is the base in big data processing Plinth link, how to generate the markup information of high quality is the challenge to existing big data algorithm system.

Available data mark system is mainly divided several:

The first is the data mark based on manual identified, is completely dependent on the priori knowledge of the mankind to carry out text marking.

Second is the automation mask method based on template, rule or machine learning.

The third is the mask method based on semi-automatic mode, and the fuzzy mark of meaning is thought after mainly being marked by machine Carry out desk checking.

In current data mark system, for manually marking system, main problem is that it is completely dependent on manual method, Accuracy can improve, but have certain requirement to the knowledge hierarchy of labeler itself, and mark efficiency and Cost is all very high.And the text marking based on machine learning, it is completely dependent on the knowledge of algorithm research person, efficiency is very high, still There are certain shortcomings for accuracy, poor to many special circumstances adaptability, currently often to the verification of the accuracy of method two All or by way of manually proofreading, and for new mark demand lack quick adaptability.

Under current big data and the collective effect of AI, the new related data of over ten billion item, this logarithm can be all generated daily Great challenge is brought according to mark, main problem concentrates on the accuracy of mark, the efficiency of mark, the cost and mark of mark Many challenges such as scalability.

According to current data volume, if it is desired to obtaining the data of a high quality marked, manual method has been applied alone It can not be achieved, the labeler for being otherwise engaged in text or image labeling generally requires very strong ability and just can guarantee data Consistency.

And full-automatic method is there is also certain problem, mass data first may not shown in the algorithm Very strong uncertainty is only derived from the possible phraseological change of partial information therein.Second, spoken and written languages be In constantly evolving, new grammatical phenomenon, the even relevant knowledges such as language setting how are adapted to.

Most of all, new model new business is constantly born in Internet era, old mark can not often be suitable for new Business scenario, this results in the need for bringing the scalability of mark very big requirement.Traditional mark system is to newly marking Extension is often exactly to make a new start, this also brings very big burden to system design.

Chinese patent literature CN201510130022.4, the applying date 20150325, patent name are as follows: a kind of data mark Management method and device disclose the management method and device of a kind of data mark.Its method includes: that acquisition and data mark are appointed It is engaged in corresponding data set, and mark corresponding with Various types of data in data set rule；Data set is divided into data subset；According to Mark corresponding with the Various types of data rule obtained generates the data mark subtask description information of data subset and issues；To The data subset is sent to the first sender for claiming request of data subset；Receive the sender's for claiming request from first Data after mark, wherein the management method further includes by the calling of annotation tool corresponding with Various types of data in data subset Instruction is sent；And/or containing all kinds of in the data subset in the data mark subtask description information of the data subset of publication Data format after the target mark of data.

Above patent document, which is avoided, carries out Data Format Transform to the data after mark, but about one kind for artificial The efficiency and training cost problem of mark, while the problem relatively low for accuracy rate in automation mark.In conjunction with crowd The requirement that artificial ability is reduced in packet, by improved the advantages of matching algorithm combination three mark in efficiency, accuracy and The technical solution of scalability is then without corresponding open.

It is a kind of for the efficiency manually marked and training cost problem in summary, while in automation mark The relatively low problem of accuracy rate.In conjunction with the requirement for reducing artificial ability in crowdsourcing, the advantages of by matching algorithm combination three The data mask method and device of the efficiency in mark, accuracy and scalability are improved, and about this data mark side Method and device yet there are no report.

Summary of the invention

It is a kind of for the efficiency and training that manually mark the purpose of the present invention is aiming at the shortcomings in the prior art, providing Instruct cost problem, while the problem relatively low for accuracy rate in automation mark.In conjunction with reducing artificial ability in crowdsourcing It is required that being marked by the data for improving the efficiency in mark, accuracy and scalability the advantages of matching algorithm combination three Method.

Another object of the present invention is: providing a kind of device based on big data crowdsourcing model data mark.

To achieve the above object, the technical solution adopted by the present invention is that:

A method of it is marked, be the described method comprises the following steps based on big data crowdsourcing model data:

Step S1. provides pre- data mark using automation algorithm notation methods；

Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler and lead in profession to mistake The ability need in domain；

It is accurate to improve data by crowdsourcing Data Integration, all data annotation results of comprehensive statistics by step S3..

As a kind of perferred technical scheme, depth LSTM, bi- are based on to the algorithm that data are labeled in step S1 The classifier of LSTM, GRU or CNN neural network design.

As a kind of perferred technical scheme, following by the way of generation to unbalanced mark using rule and similar mark It is marked repeatedly with new mark.

As a kind of perferred technical scheme, by the statistics to crowdsourcing result after the completion of being marked in step S2, to new mark Note provides a large amount of machine learning resource, carries out the promotion of machine learning learning effect while mark.

As a kind of perferred technical scheme, the method also includes the forms of new type mark and false information shuffling.

As a kind of perferred technical scheme, the method in such a way that rule and similar mark follow to it is new mark into Row shows the mode of judgement.

As a kind of perferred technical scheme, the crowdsourcing model in step S2 is by the way of based on simple True-False, and It is not limited in the mode of judgement.

To realize above-mentioned second purpose, the technical solution adopted by the present invention is that:

A kind of device based on big data crowdsourcing model data mark, described device include at least one method described above On application.

The invention has the advantages that:

1, a kind of method and device based on big data crowdsourcing model data mark of the invention, for the effect manually marked Rate problem and training cost problem, while the problem relatively low for accuracy rate in automation mark.In conjunction with being reduced in crowdsourcing The requirement of artificial ability, by improving efficiency, accuracy and scalability in mark the advantages of matching algorithm combination three.

2, the present invention uses crowdsourcing technology, and crowdsourcing technology is a kind of new technology of big data field, and main purpose is to reduce Artificial or automation error message when data generate, current application and research are mainly concentrated in administer in crowdsourcing and occur Mark horizontal problem and crowdsourcing result improvement in terms of, improve accuracy field.

3, when being marked in advance to basic data, depth LSTM, bi-LSTM are based on to the algorithm that data are labeled, The classifier of the neural networks such as GRU or CNN design can use transfer learning when carrying out classification learning to new mark Mode accelerate the learning rate of algorithm, improve efficiency the adaptability with new business.

4, when being distributed using crowdsourcing model to all marks, using the side for following generation of rule and similar mark Formula marks unbalanced mark and new mark repeatedly, is adapted to facilitating to new business and new demand, enables data To it is close mark or rule in difference and details distinguish, and old data can be marked again and vacation The means such as information back annotation, to improve the accuracy marked always.

It 5, can also be by the statistics to crowdsourcing result, to new after being distributed using crowdsourcing model to all marks Mark provides a large amount of education resource, carries out study promotion while mark, improves the verification efficiency manually marked.

5, marked in advance using automation algorithm, artificial mark only judgement automation mark to mistake, when avoiding artificial cognition Extra information select bring efficiency.

6, data mark is marked using automation and new type marks and the form of false information shuffling, to provide preferably letter Breath compares to be collected with crowdsourcing algorithm, data corruption caused by avoiding artificially.

Detailed description of the invention

Attached drawing 1 is a kind of flow diagram of method based on big data crowdsourcing model data of the invention.

Specific embodiment

It elaborates with reference to the accompanying drawing to specific embodiment provided by the invention.

Fig. 1 is please referred to, Fig. 1 is a kind of flow diagram of method based on big data crowdsourcing model data of the invention. A method of based on big data crowdsourcing model data, the described method comprises the following steps:

Step S1. provides pre- data mark using automation algorithm notation methods；

Wherein step S1 is based on depth LSTM, the nerve nets such as bi-LSTM, GRU or CNN to the algorithm that data are labeled The classifier of network design can accelerate algorithm when carrying out classification learning to new mark using the mode of transfer learning Rate is practised, the adaptability with new business is improved efficiency.

Following by the way of generation to unbalanced mark and new mark progress using rule and similar mark in step s 2 It marks, new business and new demand is adapted to repeatedly with facilitating, enable data to the difference in close mark or rule Point and details distinguish, and can be marked again to old data and the means such as false information back annotation, with raising The accuracy marked always.

After the completion of step S2, a large amount of education resource can also be provided to new mark by the statistics to crowdsourcing result, Study promotion is carried out while mark, improves the verification efficiency manually marked.

In the embodiment it should be understood that

Crowdsourcing model reduces manually-operated complexity based on the mode of simple True-False in the example, it is not limited to The mode of judgement, other are capable of the algorithm of the artificial matching result of quick obtaining and methods of exhibiting while should also cover in the invention Under conception.

It is shown the mode of judgement in the example to new mark in such a way that rule and similar mark follow, but not only It is limited to upper type, other algorithms that can be improved new mark matching capacity should also be covered under the conception of the invention.

The example is marked in advance using automation algorithm, artificial mark only judgement automation mark to mistake, avoid manually sentencing Extra information when other selects bring efficiency.

Instance data mark is marked using automation and the form of new type mark and false information shuffling, to provide more preferably Information compare and collected with crowdsourcing algorithm, avoid it is artificial caused by data corruption.

The present invention once puts into application, and following technical effect may be implemented:

In the case where simply judging, a possibility that reducing the ability need of labeler, and reduce error.Through surveying It calculates, is manually complicated 20 times for marking selection to simply mark judgement.But due to the mode of crowdsourcing, need to multiple marks Person carries out data distribution, so 20 times efficient is theoretically not achieved, but the ability that this method greatly reduces labeler needs It asks, in the case where improving efficiency and do not reduce accuracy, improves the expandability for the system that greatly increases.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise of not departing from the method for the present invention, can also make several improvement and supplement, these are improved and supplement also should be regarded as Protection scope of the present invention.

Claims

1. a kind of method based on big data crowdsourcing model data mark, which is characterized in that the described method comprises the following steps:

Step S1. provides pre- data mark using automation algorithm notation methods；

Step S2. is distributed all marks by crowdsourcing model, and only judgement mark lowers labeler in professional domain to mistake Ability need；

2. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S1 Depth LSTM, bi-LSTM, the classifier of GRU CNN neural network design are based on to the algorithm that data are labeled.

3. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that using rule The mode of generation is followed to mark unbalanced mark and new mark repeatedly with similar mark.

4. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S2 By the statistics to crowdsourcing result after the completion of mark, provide a large amount of machine learning resource to new mark, while mark into Row machine learning learning effect is promoted.

5. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that the method It further include the form of new type mark and false information shuffling.

6. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that the method The mode of judgement is shown to new mark in such a way that rule and similar mark follow.

7. the method according to claim 1 based on big data crowdsourcing model data mark, which is characterized in that in step S2 Crowdsourcing model by the way of based on simple True-False, and be not limited in judgement mode.

8. a kind of device based on big data crowdsourcing model data mark, which is characterized in that described device is appointed in claim 1-7 Application in one the method.