CN110427622A - Appraisal procedure, device and the storage medium of corpus labeling - Google Patents

Appraisal procedure, device and the storage medium of corpus labeling Download PDF

Info

Publication number
CN110427622A
CN110427622A CN201910668462.3A CN201910668462A CN110427622A CN 110427622 A CN110427622 A CN 110427622A CN 201910668462 A CN201910668462 A CN 201910668462A CN 110427622 A CN110427622 A CN 110427622A
Authority
CN
China
Prior art keywords
corpus
assessed
mark
remaining
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910668462.3A
Other languages
Chinese (zh)
Inventor
童丽霞
雷植程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910668462.3A priority Critical patent/CN110427622A/en
Publication of CN110427622A publication Critical patent/CN110427622A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

This application involves a kind of appraisal procedure of corpus labeling, device and storage medium, the appraisal procedure of the corpus labeling includes: the first initial mark that at least one corpus to be assessed and every corpus to be assessed are obtained from corpus;The first mark of corpus to be assessed is determined according to corpus remaining in corpus to be assessed and corpus;The second mark of corpus to be assessed is determined using the disaggregated model trained;According to the first mark and the second mark, determine the first assessment result initially marked of corresponding corpus to be assessed, to when carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, and then without being checked one by one corpus all in corpus, the workload for reducing corpus labeling personnel improves corpus check efficiency.

Description

Appraisal procedure, device and the storage medium of corpus labeling
Technical field
This application involves field of computer technology more particularly to a kind of appraisal procedure of corpus labeling, device and storage to be situated between Matter.
Background technique
It, generally can be by artificial in order to be better understood from problem described in user in intelligent customer service system Construction The mode of a large amount of corpus is marked to reinforce the understandability of machine learning model.But since different mark personnel are to same The understanding of business, which can have deviation and mark personnel, will usually complete biggish mark amount, lead to the presence of certain ratio in corpus The marking error corpus of example.
Therefore, it in order to ensure the accuracy of corpus labeling, needs to check the corpus after mark with will be in corpus The corpus of mistake, which checks, to be come, and existing technical solution is mainly artificial corpus check.
But when the corpus in corpus is increasing, manual review is difficult to traverse existing corpus in corpus and goes It makes reference, and takes time and effort.
Summary of the invention
The embodiment of the present application provides appraisal procedure, device and the storage medium of a kind of corpus labeling, to reduce artificial corpus The workload of check, and improve the efficiency of corpus check.
The embodiment of the present application provides a kind of appraisal procedure of corpus labeling, comprising:
The first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus Note;
The first mark of the corpus to be assessed is determined according to corpus remaining in the corpus to be assessed and the corpus;
The second mark of the corpus to be assessed is determined using the disaggregated model trained;
According to first mark and the second mark, the first assessment initially marked of the corresponding corpus to be assessed is determined As a result.
Wherein, described that the corpus to be assessed is determined according to corpus remaining in the corpus to be assessed and the corpus First mark, specifically includes:
Determine the similarity in the corpus to be assessed and the corpus between every remaining corpus;
Similar corpus is determined from the remaining corpus according to the similarity;
Obtain the second initial mark of the similar corpus;
The first mark of the corpus to be assessed is determined according to the described second initial mark.
Wherein, the similarity in the determination corpus to be assessed and the corpus between every remaining corpus, tool Body includes:
It determines corresponding first term vector of every corpus to be assessed, and determines every remaining language in the corpus Expect corresponding second term vector;
Corresponding first sentence vector is determined according to first term vector, and is determined and corresponded to according to second term vector The second sentence vector;
The corresponding corpus to be assessed and remaining corpus are calculated according to the first sentence vector sum the second sentence vector Between similarity.
Wherein, corresponding first term vector of the corpus to be assessed of the determination every, and determine in the corpus Corresponding second term vector of every residue corpus, specifically includes:
Every corpus to be assessed is split into multiple first character fields, and by every in the corpus remaining language Material splits into multiple second character fields;
Corresponding first keyword is determined according to first character field, and is determined according to second character field corresponding Second keyword;
Corresponding first term vector is determined according to first keyword, and is determined according to second keyword corresponding Second term vector.
Wherein, first mark that the corpus to be assessed is determined according to the described second initial mark, specifically includes:
The identical similar corpus is initially marked by described second and is classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The most similar corpus group corresponding described second of the item number is initially marked, as the language to be assessed First mark of material.
Wherein, it before the second mark for determining the corpus to be assessed using the disaggregated model trained described, also wraps It includes:
The third for obtaining each corpus sample in corpus sample set and the corpus sample set initially marks;
It is initially marked using the corpus sample set and third and preset disaggregated model is trained, obtain described instructed Experienced disaggregated model.
Wherein, described according to first mark and the second mark, determine that the first of the corresponding corpus to be assessed is initial The assessment result of mark, specifically includes:
Judge that the first initial corpus of the corpus to be assessed is marked with corresponding first mark and described second It is whether identical;
If the first initial corpus of the corpus to be assessed and corresponding first mark and the second mark are all the same, It will indicate first assessment result that initially marks of the correct result as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is identical as corresponding first mark or the second mark, will Indicate first assessment result that initially marks of the result as the corpus to be assessed of doubtful mistake;
If the first initial corpus of the corpus to be assessed is all different with corresponding first mark and the second mark, Then using the suspicious result of indicated altitude as the first assessment result initially marked of the corpus to be assessed.
The embodiment of the present application also provides a kind of assessment devices of corpus labeling, comprising:
Module is obtained, for obtaining at least one corpus to be assessed and every corpus to be assessed from corpus The first initial mark;
First determining module, it is described to be evaluated for being determined according to corpus remaining in the corpus to be assessed and the corpus Estimate the first mark of corpus;
Second determining module, for determining the second mark of the corpus to be assessed using the disaggregated model trained;
Third determining module, for determining the corresponding corpus to be assessed according to first mark and the second mark First assessment result initially marked.
Wherein, first determining module specifically includes:
First determination unit, for determining the phase between the corpus to be assessed and every in the corpus remaining corpus Like degree;
Second determination unit, for determining similar corpus from the remaining corpus according to the similarity;
Acquiring unit, the second initial mark for obtaining the similar corpus;
Third determination unit, for determining the first mark of the corpus to be assessed according to the described second initial mark.
Wherein, first determination unit, specifically includes:
First determines subelement, for determining corresponding first term vector of every corpus to be assessed, and determines institute State corresponding second term vector of every residue corpus in corpus;
Second determines subelement, for determining corresponding first sentence vector according to first term vector, and according to institute It states the second term vector and determines corresponding second sentence vector;
Computation subunit, it is corresponding described to be assessed for being calculated according to the first sentence vector sum the second sentence vector Similarity between corpus and remaining corpus.
Wherein, described first determine that subelement is specifically used for:
Every corpus to be assessed is split into multiple first character fields, and by every in the corpus remaining language Material splits into multiple second character fields;
Corresponding first keyword is determined according to first character field, and is determined according to second character field corresponding Second keyword;
Corresponding first term vector is determined according to first keyword, and is determined according to second keyword corresponding Second term vector.
Wherein, the third determination unit is specifically used for:
The identical similar corpus is initially marked by described second and is classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The most similar corpus group corresponding described second of the item number is initially marked, as the language to be assessed First mark of material.
Wherein, the assessment device of the corpus labeling further includes the 4th determining module, and the 4th determining module is used for:
The third for obtaining each corpus sample in corpus sample set and the corpus sample set initially marks;
It is initially marked using the corpus sample set and third and preset disaggregated model is trained, obtain described instructed Experienced disaggregated model.
Wherein, the third determining module is specifically used for:
Judge that the first initial corpus of the corpus to be assessed is marked with corresponding first mark and described second It is whether identical;
If the first initial corpus of the corpus to be assessed and corresponding first mark and the second mark are all the same, It will indicate first assessment result that initially marks of the correct result as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is identical as corresponding first mark or the second mark, will Indicate first assessment result that initially marks of the result as the corpus to be assessed of doubtful mistake;
If the first initial corpus of the corpus to be assessed is all different with corresponding first mark and the second mark, Then using the suspicious result of indicated altitude as the first assessment result initially marked of the corpus to be assessed.
The embodiment of the present application also provides a kind of computer readable storage medium, a plurality of finger is stored in the storage medium It enables, described instruction is suitable for being loaded by processor to execute the appraisal procedure of any of the above-described corpus labeling.
Appraisal procedure, device and the storage medium of corpus labeling provided by the present application, by being obtained at least from corpus First initial mark of one corpus to be assessed and every corpus to be assessed, later according to the corpus and corpus to be assessed Middle residue corpus determines the first mark of the corpus to be assessed, and determines the of corpus to be assessed using the disaggregated model trained Two marks determine the first assessment result initially marked of corresponding corpus to be assessed then according to the first mark and the second mark, To which when carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, into Without being checked one by one corpus all in corpus, reduces the workload of corpus labeling personnel, it is multiple to improve corpus Look into efficiency.
Detailed description of the invention
With reference to the accompanying drawing, it is described in detail by the specific embodiment to the application, the technical solution of the application will be made And other beneficial effects are apparent.
Fig. 1 is the schematic diagram of a scenario of the assessment system of corpus labeling provided by the embodiments of the present application.
Fig. 2 is the flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application.
Fig. 3 is the flow diagram of S102 provided by the embodiments of the present application.
Fig. 4 is the execution flow diagram of S1024 provided by the embodiments of the present application.
Fig. 5 is another flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application.
Fig. 6 is another flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application.
Fig. 7 is the structural schematic diagram of the assessment device of corpus labeling provided by the embodiments of the present application.
Fig. 8 is the structural schematic diagram of the first determining module 120 provided by the embodiments of the present application.
Fig. 9 is the structural schematic diagram of server provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall in the protection scope of this application.
The embodiment of the present application provides appraisal procedure, device and the storage medium of a kind of corpus labeling.
Referring to Fig. 1, Fig. 1 is the schematic diagram of a scenario of the assessment system of corpus labeling provided by the embodiments of the present application, the language The assessment system of material mark may include the assessment device of any corpus labeling provided by the embodiments of the present application, the corpus labeling Assessment device specifically can integrate in server, such as the background server of intelligent customer service system.
The server can obtain the first of at least one corpus to be assessed and every corpus to be assessed from corpus Initial mark;The first mark of corpus to be assessed is determined according to corpus remaining in corpus to be assessed and corpus;Using having trained Disaggregated model determine corpus to be assessed second mark;According to the first mark and the second mark, corresponding corpus to be assessed is determined The first assessment result initially marked.
Wherein, the corpus and the disaggregated model trained can store in server, if the corpus may include Dry item has marked corpus, and it can be the corpus for belonging to same application field, such as customer service chat that this several, which have marked corpus, The dialogue corpus of record, the corpus can be used as the training corpus that machine language understands model.Wherein, in the corpus The mark that each has marked corpus corresponds to the first initial mark for having marked corpus, which can be logical Cross what corpus labeling personnel marked, accuracy has to be assessed.
In addition, the assessment system of the corpus labeling can also include the client for being equipped with corpus labeling tool, the client End can be the terminals such as mobile phone, tablet computer, desktop computer, at the beginning of which can check the first of corpus to be assessed for user Begin the assessment result marked, and then checks convenient for user's corpus not high to the first initial mark accuracy, and to wherein The corpus of first initial marking error is corrected.
For example, in Fig. 1, server can obtain corpus 1 " not withdrawing deposit " to be assessed and its first initially from corpus Mark " error of withdrawing deposit " and corpus to be assessed 2 " prestige deduction of points to the upper limit " and its first initial mark " restoring prestige point ", root Determine that the first of corpus 1 to be assessed is labeled as " change is withdrawn deposit unsuccessfully " according to corpus remaining in corpus 1 to be assessed and corpus, and benefit Determine corpus 1 to be assessed with the disaggregated model trained second is labeled as " change is withdrawn deposit unsuccessfully ", and same method can determine The first mark and the second mark for obtaining corpus 2 to be assessed are " restoring prestige point ", and server is according to corpus 1 to be assessed later The first mark and the second mark, determine that the first assessment result initially marked of corpus 1 to be assessed is " height suspicious ", equally The first assessment result initially marked that method can determine to obtain corpus 2 to be assessed is " correct ", and then, server can be with Receive the request of checking of the assessment result that initially marks about the first of corpus to be assessed of client, and checked according to this request to The first assessment result initially marked that client sends corpus to be assessed.
As shown in Fig. 2, Fig. 2 is the flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application, the language The appraisal procedure detailed process of material mark can be such that
S101. the first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus Note.
Wherein, which can be used as the training corpus that machine language understands model, including several have marked language Material, and it is the corpus for belonging to same application field or close application field that this several, which have marked corpus, for example client chats and remembers The dialogue corpus of record.In the prior art, the mark that corpus has been marked in the corpus is usually to be marked by corpus labeling personnel It arrives, to be completed since different corpus labeling personnel can have deviation and corpus labeling personnel usually to the understanding of same corpus Biggish mark amount leads to can have a certain proportion of marking error corpus in corpus, therefore, it is necessary to in the corpus The mark for marking corpus carries out accuracy evaluation, and marking error corpus therein is checked to come, and then improves machine language Understand the training effect of model.
In the present embodiment, the assessment device of corpus labeling can obtain one or more from corpus at random and mark Corpus, to obtain at least one corpus to be assessed, wherein the initial mark of the first of corpus to be assessed can mark language for correspondence The artificial mark of material, accuracy has to be assessed.
S102. the first mark of corpus to be assessed is determined according to corpus remaining in corpus to be assessed and corpus.
Wherein, in the corpus remaining corpus refer in the corpus in addition to above-mentioned corpus to be assessed other marked Corpus.In the present embodiment, the assessment device of corpus labeling can calculate every in a corpus to be assessed and the corpus one by one The similarity of item residue corpus, and will make in the residue corpus with the biggish corpus that marked of the similarity of this corpus to be assessed For the similar corpus of this corpus to be assessed, the first mark of this corpus to be assessed is then determined according to the similar corpus Note.
Specifically, as shown in figure 3, above-mentioned S102 can be specifically included:
S1021. the similarity in corpus and corpus to be assessed between every remaining corpus is determined.
Currently, the method for calculating corpus similarity mainly includes editing distance (Edit Distance) calculation method, outstanding card The inverse calculating side text frequency (TF-IDF) of German number (Jaccard index) calculation method, word frequency (TF) calculation method, word frequency- Method and term vector (Word2Vec) calculation method etc..
Wherein, Word2Vec calculation method can be calculated in conjunction with the semantic information of corpus, obtained corpus similarity Accuracy it is higher, therefore, in the present embodiment, the assessment device of corpus labeling can be preferably by Word2Vec calculation method Calculate the similarity in corpus and corpus to be assessed between every remaining corpus.Specifically, the calculating side Word2Vec is being utilized When method calculates corpus similarity, it is necessary first to be segmented to corpus, each the participle correspondence for being then based on the corpus obtains All term vector phase adductions of the corpus can be averaging, to obtain the sentence vector of the corpus, Zhi Houzai by term vector later The similarity of the two corpus can be obtained in included angle cosine value by calculating the sentence vector of two corpus.
In one embodiment, above-mentioned S1021 can be specifically included:
S1-1. it determines corresponding first term vector of every corpus to be assessed, and determines every remaining corpus in corpus Corresponding second term vector.
Wherein, above-mentioned S1-1 can be specifically included:
S1-1-1. every corpus to be assessed is split into multiple first character fields, and by every in corpus remaining language Material splits into multiple second character fields.
Wherein, the assessment device of corpus labeling can be using segmenting methods such as stammerers (jieba), to every corpus to be assessed And every remaining corpus carries out word segmentation processing in corpus, obtains multiple first characters of every corpus to be assessed with correspondence Multiple second character fields of every remaining corpus in section and corpus.
S1-1-2. corresponding first keyword is determined according to the first character field, and is determined according to the second character field corresponding Second keyword.
Wherein, may exist in the first character field as obtained from carrying out word segmentation processing to corpus and the second character field Some stop words (such as " ", " ", " " etc.) and non-text character (such as punctuation mark, additional character etc.), and this A little stop words and the typically no tangible meaning of non-text character but frequency of use is very high, therefore, in order to save memory space and The efficiency for improving machine learning can carry out stop words and non-text character to above-mentioned first character field and the second character field Processing, to obtain corresponding first keyword and the second keyword.
S1-1-3. corresponding first term vector is determined according to the first keyword, and is determined according to the second keyword corresponding Second term vector.
Wherein, it is to be assessed by every to can use the word2vec term vector tool trained for the assessment device of corpus labeling First keyword of corpus is converted to corresponding first term vector, and every in corpus remaining corpus is converted to corresponding Second term vector.
S1-2. corresponding first sentence vector is determined according to the first term vector, and is determined according to the second term vector corresponding Second sentence vector.
Wherein, the assessment device of corpus labeling can take the first term vector of every corpus to be assessed using linear weighted function The method that average value is returned constructs corresponding first sentence vector, and constructs corresponding second in the same way Sentence vector.
S1-3. it is calculated between corresponding corpus to be assessed and remaining corpus according to first sentence vector sum the second sentence vector Similarity.
Wherein it is possible to be determined corresponding by calculating the COS distance between first sentence vector sum the second sentence vector Similarity in corpus and corpus to be assessed between every remaining corpus.
S1022. similar corpus is determined from remaining corpus according to similarity.
Corpus to be assessed for each, in the phase that the corpus to be assessed with every in corpus remaining corpus is calculated After degree, the assessment device of corpus labeling can select the biggish corpus that marked of similarity as this from above-mentioned remaining corpus The similar corpus of corpus to be assessed.
S1023. the second initial mark of similar corpus is obtained.
Wherein, which is the mark corpus in above-mentioned corpus, the initial mark of the second of the similar corpus of each Note is the corresponding mark for having marked corpus, and is specifically as follows the corresponding artificial mark for having marked corpus.
S1024. the first mark of corpus to be assessed is determined according to the second initial mark.
In the present embodiment, the assessment device of corpus to be assessed for each, corpus labeling can be to be assessed based on this The second of all similar corpus of corpus is initial to mark the first mark for determining the corpus to be assessed, wherein the corpus to be assessed The first mark and the first initial mark may be different, it is also possible to it is identical, if also, the two difference, illustrate the language to be assessed The accuracy that the first of material initially marks, which exists, to be queried, and corpus labeling personnel is needed to verify.
Wherein, above-mentioned S1024 can be specifically included:
S2-1. second is initially marked identical similar corpus and is classified as one group, obtain at least one similar corpus group.
For example, as shown in figure 4, the similar corpus of a corpus to be assessed has 10, number consecutively X1-X10, wherein Number is the second of the similar corpus of X1, X2, X5, X7 initially to mark identical, is L11, the similar language of number X3, X6, X9 The second of material initially marks identical, is L12, and the second of the similar corpus of number X4, X10 initially marks identical, is L13, the second of the similar corpus that number is X8 are initially labeled as L14, then can will number similar corpus for X1, X2, X5, X7 It is classified as the first similar corpus group, the similar corpus of number X3, X6, X9 are classified as the second similar corpus group, number X4, X10 Similar corpus is classified as the similar corpus group of third, and the similar corpus that number is X8 is classified as the 4th similar corpus group, so, it is possible to obtain Four similar corpus groups.
S2-2. the item number of similar corpus in each similar corpus group is counted.
Then a upper example, above-mentioned first, second and third and four the item number of similar corpus in similar corpus group be followed successively by 4,3,2 With 1.
S2-3. the first mark by the corresponding second initial mark of the most similar corpus group of item number, as corpus to be assessed Note.
Then a upper example, as shown in figure 4, possessing the similar corpus group of the item number of similar corpus at most is the first similar language Material group, and the first similar corpus group corresponding second is initially labeled as L11, that is, the first of above-mentioned corpus to be assessed marks For L11.
S103. the second mark of corpus to be assessed is determined using the disaggregated model trained.
In the present embodiment, corpus to be assessed can be input to point trained by the assessment device of corpus labeling one by one In class model, to be marked again to corpus to be assessed, the second mark for corresponding to corpus to be assessed is obtained.Wherein, to be assessed Second mark of corpus may be different with the first initial mark, it is also possible to and it is identical, if also, the two difference, illustrate that this is to be evaluated Estimate the accuracy that the first of corpus initially marks and there is query, corpus labeling personnel is needed to verify.
Specifically, the above-mentioned disaggregated model trained can also include: before above-mentioned S103 in order to obtain
The third that step A. obtains each corpus sample in corpus sample set and corpus sample set initially marks.
In the present embodiment, the whole in above-mentioned corpus corpus can have been marked into as corpus sample, to obtain language Expect sample set, wherein the third of each corpus sample is initially labeled as the corresponding mark for having marked corpus in the corpus sample set, And it is specifically as follows the corresponding artificial mark for having marked corpus.
Step B. is initially marked using corpus sample set and third and is trained to preset disaggregated model, has been trained Disaggregated model.
Specifically, the assessment device of corpus labeling can first propose corpus sample each in corpus sample set progress feature It takes, for example keyword extraction or Feature Words extract, and to construct the feature vector of corresponding corpus sample, are then based on the corpus sample It concentrates the feature vector of all corpus samples and third initially to mark to be trained above-mentioned preset disaggregated model.
Wherein, the training process of above-mentioned preset disaggregated model can be indicated with following formula:
Ci=f (Ti);
Wherein, it is the mark of corpus i that Ti, which is the corpus i, Ti indicated with feature vector, and f is disaggregated model.In training rank Section, it is known that several couples of Ti and Ti can sum up f by machine learning.It in the present embodiment, can be using only heat (one- Hot) the methods of coding, language model (n-gram) carry out feature extraction to above-mentioned corpus sample, to obtain corresponding corpus sample Feature vector later can be using the methods of support vector machines (SVM) to the feature vector of above-mentioned corpus sample It practises, with the disaggregated model trained.
After the disaggregated model trained, in above-mentioned S103, i.e., the disaggregated model pair trained using this Corpus to be assessed is marked again, to obtain corresponding to the second mark of corpus to be assessed.In the process, similar with front , it is also desirable to feature extraction first is carried out to corpus to be assessed, to obtain corresponding to the feature vector of corpus to be assessed, then, according to Above-mentioned formula it is found that f and Ti at this time it is known that Ci therefore can be calculated, namely the second mark of corresponding corpus to be assessed.
S104. according to the first mark and the second mark, the first assessment knot initially marked of corresponding corpus to be assessed is determined Fruit.
Corpus to be assessed for each, the first mark and the second mark are two relative to its first initial mark Secondary mark, and three is obtained by three kinds of different mask methods, therefore, can be marked using the first mark and second with it is corresponding The otherness that the first of corpus to be assessed initially marks, the accuracy initially marked to the first of corpus to be assessed are assessed.
Specifically, as shown in figure 5, above-mentioned S104 can be specifically included:
S1041. judge corpus to be assessed the first initial corpus and it is corresponding first mark and second mark whether phase Together, if being, S1042 is executed, if only one of which is, executes S1043, if no, executes S1044.
S1042. the assessment result that will indicate that correct result is initially marked as the first of corpus to be assessed.
S1043. the assessment result that will indicate that the result of doubtful mistake is initially marked as the first of corpus to be assessed.
S1044: the assessment result that the suspicious result of indicated altitude is initially marked as the first of corpus to be assessed.
It further, can also should after obtaining the first assessment result initially marked of above-mentioned corpus to be assessed Corpus to be assessed is labeled as having assessed corpus, and at least one not evaluated corpus conduct of mark is obtained from corpus Then corpus to be assessed executes above-mentioned S102, S103 and S104 again, so circulation is until all in corpus to have marked corpus equal It is marked as having assessed corpus.
In addition, after completing to the mark accuracy evaluations for having marked corpus all in corpus, user can also be The user interface of corpus labeling check sends the assessment result that corpus has been marked in corpus to the assessment device of corpus labeling Request is checked, so that the assessment device of corpus labeling can check that request sends assessment result to above-mentioned user interface according to this. Specifically, when corpus labeling personnel are to having marked corpus in above-mentioned corpus and checking, the property of can choose only to assessment As a result the corpus for being designated as " doubtful mistake " and " height is suspicious " is verified, and to the corpus for being determined as marking error after verification It is labeled correction, thus, the workload of corpus labeling check is greatly reduced, the efficiency of corpus labeling check is improved.
From the foregoing, it will be observed that the appraisal procedure of corpus labeling provided in this embodiment, by obtaining at least one from corpus First initial mark of corpus to be assessed and every corpus to be assessed, remains according in the corpus and corpus to be assessed later Remaining corpus determines the first mark of the corpus to be assessed, and the second mark of corpus to be assessed is determined using the disaggregated model trained Note determines the first assessment result initially marked of corresponding corpus to be assessed then according to the first mark and the second mark, thus When carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, Jin Erwu Corpus all in corpus need to be checked one by one, reduce the workload of corpus labeling personnel, improve corpus check effect Rate.
As shown in fig. 6, Fig. 6 is another flow diagram of the appraisal procedure of corpus labeling provided by the embodiments of the present application, The appraisal procedure detailed process of the corpus labeling can be such that
S201. the first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus Note.
For example, the assessment device of corpus labeling can be obtained at random from corpus at least one marked corpus be used as to Corpus is assessed, or the corpus that marked all in corpus first can also be divided into more parts, portion is then therefrom taken to mark language Material is used as corpus to be assessed.Wherein, the first of corpus to be assessed the initial mark can be the corresponding artificial mark for having marked corpus, Its accuracy has to be assessed.
S202. every corpus to be assessed is split into multiple first character fields, and by every in corpus remaining corpus Split into multiple second character fields.
For example, corpus a to be assessed be " like that can only be top set one public number, I wants all public entirety to set Top ", can use jieba segmenting method and carries out word segmentation processing to it, in obtain multiple first character fields " as ", " can only ", " top set ", "one", " public number ", ", ", " I ", " thinking ", " ", " all ", " public number ", " entirety " and " top set ", Wherein, same method is also applied for every remaining corpus in corpus.
S203. corresponding first keyword is determined according to the first character field, and determines corresponding according to the second character field Two keywords.
Then a upper example is carrying out the multiple of the corpus a to be assessed obtained after word segmentation processing to above-mentioned corpus a to be assessed In first character field, there are the stop words and non-text character of some not practical significances, such as " such ", "one", " one Rise " etc. stop words, and ", " punctuation mark.
In the present embodiment, above-mentioned first character field and the second character field can be gone using preset deactivated vocabulary Stop words processing, to remove function word, pronoun etc., furthermore it is also possible to using the methods of canonical (Z ipf) expression formula to above-mentioned first Non-text character in character field and the second character field is filtered, to obtain corresponding first keyword and the second keyword.
S204. corresponding first term vector is determined according to the first keyword, and determines corresponding according to the second keyword Two term vectors.
For example, the assessment device of corpus labeling can use the language for the word2vec term vector tool combination corpus trained First keyword of every corpus to be assessed is converted to corresponding first term vector by adopted information, and surplus by every in corpus Remaining corpus is converted to corresponding second term vector.
S205. corresponding first sentence vector is determined according to the first term vector, and is determined according to the second term vector corresponding Second sentence vector.
Recurrence building is carried out for example, can be averaged using linear weighted function to the first term vector of every corpus to be assessed Corresponding first sentence vector can also equally take the second term vector of every in corpus remaining corpus using linear weighted function Average value, which return, constructs corresponding second sentence vector.
S206. it is calculated between corresponding corpus to be assessed and remaining corpus according to first sentence vector sum the second sentence vector Similarity.
For example, can be determined corresponding by calculating the COS distance between first sentence vector sum the second sentence vector Similarity in corpus and corpus to be assessed between every remaining corpus.
S207. similar corpus is determined from remaining corpus according to similarity.
For example, corpus to be assessed for each, can be greater than preset threshold for similarity in above-mentioned remaining corpus Mark corpus, the similar corpus as correspondence corpus to be assessed.It wherein, is 0~1 in the range of similarity, and similarity more connects Nearly 1, when illustrating that the similarity of two corpus is higher, above-mentioned preset threshold can be with 0.8.
S208. the second initial mark of similar corpus is obtained.
Wherein, which is the mark corpus in above-mentioned corpus, the initial mark of the second of the similar corpus of each Note is the corresponding mark for having marked corpus, and is specifically as follows the corresponding artificial mark for having marked corpus.
S209. the first mark of corpus to be assessed is determined according to the second initial mark.
The specific embodiment mode of S209 in the present embodiment may refer to the specific reality of S1024 in an embodiment of the method Mode is applied, therefore details are not described herein.
S210. corresponding feature vector is determined according to the first keyword.
For example, feature extraction can be carried out using first keyword of the n-gram method to every corpus to be assessed, with To the feature vector of correspondence corpus to be assessed.Wherein, first keyword is by corresponding corpus to be assessed successively by segmenting, going It is obtained after stop words and non-text character filtration treatment, can be effectively reduced the dimension of features described above vector, and then improve and divide The classification effectiveness of class model.
In addition, n-gram method is since at the first character of text, the step-length moved every time is 1 character, and every It is secondary to take length for the characteristic item of n character, for example, extracting characteristic item for " the deduction of points upper limit " this four words in 3-gram method It is as follows: in deduction of points, to divide the upper limit, it follows that above-mentioned first keyword can be got by extracting feature using n-gram method Front and back information, that is, the word order information of corpus to be assessed.
S211. classified using the disaggregated model trained to feature vector, to obtain corresponding to the of corpus to be assessed Two marks.
In the present embodiment, the feature vector of corpus to be assessed is input in the disaggregated model trained, it can be defeated Obtain corresponding to the second mark of corpus to be assessed out.Wherein, which can be using in above-mentioned corpus It is all to have marked what corpus training obtained.
S212. according to the first mark and the second mark, the first assessment knot initially marked of corresponding corpus to be assessed is determined Fruit.
In the present embodiment, corpus to be assessed for each, the first mark and the second mark are relative to it at the beginning of first Begin mark, be secondary mark, and three obtains by three kinds of different mask methods, therefore, can using first mark with Second mark is with the first otherness initially marked of corresponding corpus to be assessed, the standard initially marked to the first of corpus to be assessed True property is assessed.Wherein, the specific embodiment mode of above-mentioned S212 may refer to the specific of S104 in an embodiment of the method Embodiment, therefore details are not described herein.
In addition, during executing above-mentioned S204 to S209 to obtain the first mark of corpus to be assessed, using Machine unsupervised learning mode, and the semantic information of corpus is considered, and S210 and S211 is being executed to obtain corpus to be assessed Second mark during, using machine supervised learning mode, and the word order information of corpus is considered, in this way, In During initially marking accuracy evaluation to the first of corpus to be assessed, supervised machine study and machine unsupervised learning are carried out It organically combines, and has fully considered the semanteme and word order information of corpus, be conducive to the accuracy for improving above-mentioned assessment result.
On the basis of above-described embodiment the method, the present embodiment will be from the angle of the assessment device of corpus labeling into one Step is described, can be with referring to Fig. 7, the assessment device of corpus labeling provided by the embodiments of the present application has been described in detail in Fig. 7 It include: to obtain module 110, the first determining module 120, the second determining module 130 and third determining module 140, in which:
(1) module 110 is obtained
Module 110 is obtained, for obtaining at least one corpus to be assessed and every corpus to be assessed from corpus First initial mark.
Wherein, which can be used as the training corpus that machine language understands model, including several have marked language Material, and it is the corpus for belonging to same application field or close application field that this several, which have marked corpus, for example client chats and remembers The dialogue corpus of record.In the present embodiment, acquisition module 110 can obtain one or more from corpus and mark corpus, To obtain at least one corpus to be assessed, wherein the initial mark of the first of corpus to be assessed can mark corpus for correspondence Artificial mark, accuracy has to be assessed.
(2) first determining modules 120
First determining module 120, for determining corpus to be assessed according to corpus remaining in corpus to be assessed and corpus First mark.
In the present embodiment, remaining corpus refers to its in the corpus in addition to above-mentioned corpus to be assessed in the corpus He has marked corpus.
Wherein, as shown in figure 8, above-mentioned first determining module 120 can specifically include:
(a) the first determination unit 121, it is similar between corpus to be assessed and every in corpus remaining corpus for determining Degree.
Currently, the method for calculating corpus similarity mainly includes editing distance (Edit Distance) calculation method, outstanding card The inverse calculating side text frequency (TF-IDF) of German number (Jaccard index) calculation method, word frequency (TF) calculation method, word frequency- Method and term vector (Word2Vec) calculation method etc..Wherein, Word2Vec calculation method can be in conjunction with the semantic information of corpus It is calculated, the accuracy of obtained corpus similarity is higher, and therefore, in the present embodiment, the first determination unit 121 can be excellent Choosing calculates the similarity in corpus and corpus to be assessed between every remaining corpus using Word2Vec calculation method.
In one embodiment, which can specifically include:
(a1) it first determines subelement, for determining corresponding first term vector of every corpus to be assessed, and determines language Expect corresponding second term vector of every residue corpus in library.
Wherein, first determine that subelement can be specifically used for:
Every corpus to be assessed is split into multiple first character fields, and every in corpus remaining corpus is split into Multiple second character fields;
Corresponding first keyword is determined according to the first character field, and determines that corresponding second is crucial according to the second character field Word;
Determine corresponding first term vector according to the first keyword, and according to the second keyword determine corresponding second word to Amount.
(a2) second subelement is determined, for determining corresponding first sentence vector according to the first term vector, and according to the Two term vectors determine corresponding second sentence vector.
Wherein, which can make even to the first term vector of every corpus to be assessed using linear weighted function The method that mean value is returned constructs corresponding first sentence vector, and constructs corresponding second in the same way Subvector.
(a3) computation subunit, for calculating corresponding corpus to be assessed according to first sentence vector sum the second sentence vector Similarity between remaining corpus.
Wherein, the computation subunit can by calculate first sentence vector sum the second sentence vector between cosine away from From determining in corresponding corpus to be assessed and corpus the similarity between every remaining corpus.
(b) the second determination unit 122, for determining similar corpus from remaining corpus according to similarity.
Corpus to be assessed for each, in the phase that the corpus to be assessed with every in corpus remaining corpus is calculated After degree, the second determination unit 122 can select from above-mentioned remaining corpus similarity it is biggish marked corpus as this to Assess the similar corpus of corpus.
(c) acquiring unit 123, the second initial mark for obtaining similar corpus.
Wherein, which is the mark corpus in above-mentioned corpus, the initial mark of the second of the similar corpus of each Note is the corresponding mark for having marked corpus, and is specifically as follows the corresponding artificial mark for having marked corpus.
(d) third determination unit 124, for determining the first mark of corpus to be assessed according to the second initial mark.
In the present embodiment, corpus to be assessed for each, third determination unit 124 can be based on the corpus to be assessed All similar corpus second initial mark the first mark for determining the corpus to be assessed, wherein the of the corpus to be assessed One mark may be different with the first initial mark, it is also possible to and it is identical, if also, the two difference, illustrate the corpus to be assessed First accuracy initially marked, which exists, to be queried, and corpus labeling personnel is needed to verify.
In one embodiment, which can be specifically used for:
Second is initially marked identical similar corpus and be classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The first mark by the corresponding second initial mark of the most similar corpus group of item number, as corpus to be assessed.
(3) second determining modules 130
Second determining module 130, for determining the second mark of corpus to be assessed using the disaggregated model trained.
In the present embodiment, corpus to be assessed can be input to the classification trained by the second determining module 130 one by one In model, to be marked again to corpus to be assessed, the second mark for corresponding to corpus to be assessed is obtained.Wherein, language to be assessed The second mark expected and the first initial mark may be different, it is also possible to and it is identical, if also, the two difference, illustrate that this is to be assessed The accuracy that the first of corpus initially marks, which exists, to be queried, and corpus labeling personnel is needed to verify.
Wherein, the disaggregated model trained can be using in above-mentioned corpus it is all marked corpus training obtain 's.
(4) third determining module 140
Third determining module 140, for being marked according to the first mark and second, at the beginning of determine corresponding corpus to be assessed first Begin the assessment result marked.
In the present embodiment, corpus to be assessed for each, the first mark and the second mark are relative to it at the beginning of first Begin mark, be secondary mark, and three obtains by three kinds of different mask methods, therefore, can using first mark with Second mark is with the first otherness initially marked of corresponding corpus to be assessed, the standard initially marked to the first of corpus to be assessed True property is assessed.
Wherein, third determining module 140 can be specifically used for:
Judge whether the first initial corpus and corresponding first mark and the second mark of corpus to be assessed are identical;
If the initial corpus of the first of corpus to be assessed and corresponding first mark and the second mark are all the same, will indicate just The assessment result that true result is initially marked as the first of corpus to be assessed;
If the initial corpus of the first of corpus to be assessed is identical as corresponding first mark or the second mark, will indicate doubtful The assessment result that the result of mistake is initially marked as the first of corpus to be assessed;
If the initial corpus of the first of corpus to be assessed is all different with corresponding first mark and the second mark, will instruction The assessment result that highly suspicious result is initially marked as the first of corpus to be assessed.
Further, the assessment device of above-mentioned corpus labeling can also include the 4th determining module, wherein the 4th determines mould Block can be specifically used for:
The third for obtaining each corpus sample in corpus sample set and corpus sample set initially marks;
The classification for being trained, having been trained to preset disaggregated model is initially marked using corpus sample set and third Model.
Specifically, the whole in above-mentioned corpus can have been marked corpus as corpus sample by the 4th determining module, To obtain corpus sample set, wherein the third of each corpus sample is initially labeled as correspondence and has marked language in the corpus sample set The mark of material, and it is specifically as follows the corresponding artificial mark for having marked corpus.Also, the 4th determining module can also be first to language Expect that each corpus sample carries out feature extraction in sample set, for example keyword extraction or Feature Words extract, to construct corresponding corpus The feature vector of sample, the feature vector and third for being then based on all corpus samples in the corpus sample set are initially marked to upper Preset disaggregated model is stated to be trained.
Wherein, the training process of above-mentioned preset disaggregated model can be indicated with following formula:
Ci=f (Ti);
Wherein, it is the mark of corpus i that Ti, which is the corpus i, Ti indicated with feature vector, and f is disaggregated model.In training rank Section, it is known that several couples of Ti and Ti can sum up f by machine learning.In the present embodiment, the 4th determining module can be with Feature extraction is carried out to above-mentioned corpus sample using the methods of only hot (one-hot) coding, language model (n-gram), to obtain The feature vector of corresponding corpus sample later can be using the methods of support vector machines (SVM) to the feature of above-mentioned corpus sample Vector is learnt, with the disaggregated model trained.
After the disaggregated model trained, above-mentioned second determining module 130 is the classification trained using this Model marks corpus to be assessed again, to obtain corresponding to the second mark of corpus to be assessed.In the process, with front Similar, which is also required to first carry out feature extraction to corpus to be assessed, to obtain corresponding to corpus to be assessed Feature vector, then, according to above-mentioned formula it is found that therefore f and Ti at this time are it is known that can be calculated Ci, namely corresponding Second mark of corpus to be assessed.
In addition, obtaining the first assessment result initially marked of above-mentioned corpus to be assessed in above-mentioned third determining module 140 Later, which can also be labeled as having assessed corpus, and triggers above-mentioned acquisition module 110 and is obtained from corpus At least one corpus of mark not being evaluated was as corpus to be assessed, then again to the first initial language of the corpus to be assessed The accuracy of material is assessed, and so circulation is until all corpus that marked are marked as having assessed corpus in corpus.
Further, it is completed in the assessment device of above-mentioned corpus labeling quasi- to the marks for having marked corpus all in corpus After true property assessment, the user interface that user can also check in corpus labeling sends corpus to the assessment device of corpus labeling In marked the assessment result of corpus and check request so that the assessment device of corpus labeling can check that request is upward according to this It states user interface and sends assessment result.Specifically, it checks in corpus labeling personnel to having marked corpus in above-mentioned corpus When, the corpus for being designated as " doubtful mistake " and " height is suspicious " to assessment result to the property of can choose is verified, and to verification The corpus for being determined as marking error afterwards is labeled correction, thus, the workload of corpus labeling check is greatly reduced, is improved The efficiency of corpus labeling check.
It, can also be into when it is implemented, above each subelement, unit and module can be used as independent entity to realize Row any combination realizes that the specific implementation of above each subelement, unit and module can be joined as same or several entities See the embodiment of the method for front, details are not described herein.
It is above-mentioned it is found that corpus labeling provided in this embodiment assessment device, by obtaining at least one from corpus First initial mark of corpus to be assessed and every corpus to be assessed, remains according in the corpus and corpus to be assessed later Remaining corpus determines the first mark of the corpus to be assessed, and the second mark of corpus to be assessed is determined using the disaggregated model trained Note determines the first assessment result initially marked of corresponding corpus to be assessed then according to the first mark and the second mark, thus When carrying out the check of artificial corpus, the corpus that can be selected mark accuracy not high according to assessment result is checked, Jin Erwu Corpus all in corpus need to be checked one by one, reduce the workload of corpus labeling personnel, improve corpus check effect Rate.
Correspondingly, the embodiment of the present application also provides a kind of server, as shown in figure 9, it illustrates the embodiment of the present application institutes The structural schematic diagram for the server being related to, specifically:
The server may include one or processor 401, one or more meters of more than one processing core Memory 402, radio frequency (Radio Frequency, RF) circuit 403, power supply 404, input unit of calculation machine readable storage medium storing program for executing The components such as 405 and display unit 406.It will be understood by those skilled in the art that the not structure of server architecture shown in Fig. 9 The restriction of pairs of server may include perhaps combining certain components or different portions than illustrating more or fewer components Part arrangement.Wherein:
Processor 401 is the control centre of the server, utilizes each of various interfaces and the entire server of connection Part by running or execute the software program and/or module that are stored in memory 402, and calls and is stored in memory Data in 402, the various functions and processing data of execute server, to carry out integral monitoring to server.Optionally, locate Managing device 401 may include one or more processing cores;Preferably, processor 401 can integrate application processor and modulatedemodulate is mediated Manage device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is main Processing wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 401.
Memory 402 can be used for storing software program and module, and processor 401 is stored in memory 402 by operation Software program and module, thereby executing various function application and data processing.Memory 402 can mainly include storage journey Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function Such as sound-playing function, image player function) etc.;Storage data area, which can be stored, uses created data according to server Deng.In addition, memory 402 may include high-speed random access memory, it can also include nonvolatile memory, for example, at least One disk memory, flush memory device or other volatile solid-state parts.Correspondingly, memory 402 can also include Memory Controller, to provide access of the processor 401 to memory 402.
During RF circuit 403 can be used for receiving and sending messages, signal is sended and received, and particularly, the downlink of base station is believed After breath receives, one or the processing of more than one processor 401 are transferred to;In addition, the data for being related to uplink are sent to base station.It is logical Often, RF circuit 403 includes but is not limited to antenna, at least one amplifier, tuner, one or more oscillators, user identity Module (SIM) card, transceiver, coupler, low-noise amplifier (LNA, Low Noise Amplifier), duplexer etc..This Outside, RF circuit 403 can also be communicated with network and other equipment by wireless communication.Any communication can be used in the wireless communication Standard or agreement, including but not limited to global system for mobile communications (GSM, Global System of Mobile Communication), general packet radio service (GPRS, General Packet Radio Service), CDMA (CDMA, Code Division Multiple Access), wideband code division multiple access (WCDMA, Wideband Code Division Multiple Access), long term evolution (LTE, Long Term Evolution), Email, short message clothes Be engaged in (SMS, Short Messaging Service) etc..
Server further includes the power supply 404 (such as battery) powered to all parts, it is preferred that power supply 404 can pass through Power-supply management system and processor 401 are logically contiguous, to realize management charging, electric discharge, Yi Jigong by power-supply management system The functions such as consumption management.Power supply 404 can also include one or more direct current or AC power source, recharging system, power supply The random components such as fault detection circuit, power adapter or inverter, power supply status indicator.
The server may also include input unit 405, which can be used for receiving the number or character letter of input Breath, and generation keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal are defeated Enter.Specifically, in a specific embodiment, input unit 405 may include touch sensitive surface and other input equipments.It is touch-sensitive Surface, also referred to as touch display screen or Trackpad, collect user on it or nearby touch operation (such as user use The operation of any suitable object or attachment such as finger, stylus on touch sensitive surface or near touch sensitive surface), and according to preparatory The formula of setting drives corresponding attachment device.Optionally, touch sensitive surface may include touch detecting apparatus and touch controller two A part.Wherein, the touch orientation of touch detecting apparatus detection user, and touch operation bring signal is detected, signal is passed Give touch controller;Touch controller receives touch information from touch detecting apparatus, and is converted into contact coordinate, then Processor 401 is given, and order that processor 401 is sent can be received and executed.Furthermore, it is possible to using resistance-type, capacitor The multiple types such as formula, infrared ray and surface acoustic wave realize touch sensitive surface.In addition to touch sensitive surface, input unit 405 can also be wrapped Include other input equipments.Specifically, other input equipments can include but is not limited to physical keyboard, function key (such as volume control Key processed, switch key etc.), trace ball, mouse, one of operating stick etc. or a variety of.
The server may also include display unit 406, the display unit 406 can be used for showing information input by user or Be supplied to the information of user and the various graphical user interface of server, these graphical user interface can by figure, text, Icon, video and any combination thereof are constituted.Display unit 406 may include display panel, optionally, can use liquid crystal display Device (LCD, Liquid Crystal Display), Organic Light Emitting Diode (OLED, Organic Light-Emitting ) etc. Diode forms configure display panel.Further, touch sensitive surface can cover display panel, when touch sensitive surface detects After touch operation on or near it, processor 401 is sent to determine the type of touch event, is followed by subsequent processing 401 basis of device The type of touch event provides corresponding visual output on a display panel.Although in Fig. 9, touch sensitive surface is with display panel Input and input function are realized as two independent components, but in some embodiments it is possible to by touch sensitive surface and are shown Show that panel is integrated and realizes and outputs and inputs function.
Although being not shown, server can also include camera, bluetooth module etc., and details are not described herein.Specifically in this reality It applies in example, the processor 401 in server can be according to following instruction, by the process pair of one or more application program The executable file answered is loaded into memory 402, and the application journey being stored in memory 402 is run by processor 401 Sequence, thus realize various functions, it is as follows:
The first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus;
The first mark of the corpus to be assessed is determined according to corpus remaining in the corpus and corpus to be assessed;
The second mark of corpus to be assessed is determined using the disaggregated model trained;
According to the first mark and the second mark, the first assessment result initially marked of corresponding corpus to be assessed is determined.
The server may be implemented achieved by the assessment device of any corpus labeling provided by the embodiment of the present application Effective effect, be detailed in the embodiment of front, details are not described herein.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..
A kind of appraisal procedure of corpus labeling, device and storage medium provided by the embodiment of the present application are carried out above It is discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, above embodiments Illustrate to be merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art, according to this The thought of application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification is not answered It is interpreted as the limitation to the application.

Claims (10)

1. a kind of appraisal procedure of corpus labeling characterized by comprising
The first initial mark of at least one corpus to be assessed and every corpus to be assessed is obtained from corpus;
The first mark of the corpus to be assessed is determined according to corpus remaining in the corpus to be assessed and the corpus;
The second mark of the corpus to be assessed is determined using the disaggregated model trained;
According to first mark and the second mark, the first assessment knot initially marked of the corresponding corpus to be assessed is determined Fruit.
2. appraisal procedure according to claim 1, which is characterized in that described according to the corpus to be assessed and the corpus Remaining corpus determines the first mark of the corpus to be assessed in library, specifically includes:
Determine the similarity in the corpus to be assessed and the corpus between every remaining corpus;
Similar corpus is determined from the remaining corpus according to the similarity;
Obtain the second initial mark of the similar corpus;
The first mark of the corpus to be assessed is determined according to the described second initial mark.
3. appraisal procedure according to claim 2, which is characterized in that the determination corpus to be assessed and the corpus Similarity in library between every remaining corpus, specifically includes:
It determines corresponding first term vector of every corpus to be assessed, and determines every remaining corpus pair in the corpus The second term vector answered;
Corresponding first sentence vector is determined according to first term vector, and determines corresponding according to second term vector Two sentence vectors;
It is calculated between the corresponding corpus to be assessed and remaining corpus according to the first sentence vector sum the second sentence vector Similarity.
4. appraisal procedure according to claim 3, which is characterized in that the corpus to be assessed of the determination every is corresponding First term vector, and determine corresponding second term vector of every residue corpus in the corpus, it specifically includes:
Every corpus to be assessed is split into multiple first character fields, and every in the corpus remaining corpus is torn open It is divided into multiple second character fields;
Corresponding first keyword is determined according to first character field, and determines corresponding second according to second character field Keyword;
Corresponding first term vector is determined according to first keyword, and determines corresponding second according to second keyword Term vector.
5. appraisal procedure according to claim 2, which is characterized in that described according to the described second initial mark determination First mark of corpus to be assessed, specifically includes:
The identical similar corpus is initially marked by described second and is classified as one group, obtains at least one similar corpus group;
Count the item number of similar corpus in each similar corpus group;
The most similar corpus group corresponding described second of the item number is initially marked, as the corpus to be assessed First mark.
6. appraisal procedure according to claim 1, which is characterized in that determine institute using the disaggregated model trained described Before the second mark for stating corpus to be assessed, further includes:
The third for obtaining each corpus sample in corpus sample set and the corpus sample set initially marks;
It is initially marked using the corpus sample set and third and preset disaggregated model is trained, obtain described trained Disaggregated model.
7. appraisal procedure according to claim 1, which is characterized in that it is described to be marked according to first mark and second, The first assessment result initially marked for determining the corresponding corpus to be assessed, specifically includes:
Judge the corpus to be assessed the first initial corpus and it is corresponding it is described first mark and it is described second mark whether It is identical;
If the first initial corpus of the corpus to be assessed and corresponding first mark and the second mark are all the same, will refer to Show first assessment result that initially marks of the correct result as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is identical as corresponding first mark or the second mark, will instruction First assessment result that initially marks of the result of doubtful mistake as the corpus to be assessed;
If the first initial corpus of the corpus to be assessed is all different with corresponding first mark and the second mark, will First assessment result that initially marks of the suspicious result of indicated altitude as the corpus to be assessed.
8. a kind of assessment device of corpus labeling characterized by comprising
Module is obtained, for obtaining the of at least one corpus to be assessed and every corpus to be assessed from corpus One initial mark;
First determining module, for determining the language to be assessed according to corpus remaining in the corpus to be assessed and the corpus First mark of material;
Second determining module, for determining the second mark of the corpus to be assessed using the disaggregated model trained;
Third determining module, for determining the first of the corresponding corpus to be assessed according to first mark and the second mark The assessment result initially marked.
9. assessment device according to claim 8, which is characterized in that first determining module specifically includes:
First determination unit, it is similar between the corpus to be assessed and every in the corpus remaining corpus for determining Degree;
Second determination unit, for determining similar corpus from the remaining corpus according to the similarity;
Acquiring unit, the second initial mark for obtaining the similar corpus;
Third determination unit, for determining the first mark of the corpus to be assessed according to the described second initial mark.
10. a kind of computer readable storage medium, which is characterized in that be stored with a plurality of instruction, the finger in the storage medium It enables and is suitable for loading the appraisal procedure for requiring 1 to 7 described in any item corpus labelings with perform claim by processor.
CN201910668462.3A 2019-07-23 2019-07-23 Appraisal procedure, device and the storage medium of corpus labeling Pending CN110427622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910668462.3A CN110427622A (en) 2019-07-23 2019-07-23 Appraisal procedure, device and the storage medium of corpus labeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910668462.3A CN110427622A (en) 2019-07-23 2019-07-23 Appraisal procedure, device and the storage medium of corpus labeling

Publications (1)

Publication Number Publication Date
CN110427622A true CN110427622A (en) 2019-11-08

Family

ID=68412045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910668462.3A Pending CN110427622A (en) 2019-07-23 2019-07-23 Appraisal procedure, device and the storage medium of corpus labeling

Country Status (1)

Country Link
CN (1) CN110427622A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144088A (en) * 2019-12-09 2020-05-12 深圳市优必选科技股份有限公司 Corpus management method, corpus management device and electronic equipment
CN112329430A (en) * 2021-01-04 2021-02-05 恒生电子股份有限公司 Model training method, text similarity determination method and text similarity determination device
CN112925910A (en) * 2021-02-25 2021-06-08 中国平安人寿保险股份有限公司 Method, device and equipment for assisting corpus labeling and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074634A1 (en) * 2004-10-06 2006-04-06 International Business Machines Corporation Method and apparatus for fast semi-automatic semantic annotation
CN102662930A (en) * 2012-04-16 2012-09-12 乐山师范学院 Corpus tagging method and corpus tagging device
CN108170670A (en) * 2017-12-08 2018-06-15 东软集团股份有限公司 Distribution method, device, readable storage medium storing program for executing and the electronic equipment of language material to be marked
CN109739956A (en) * 2018-11-08 2019-05-10 第四范式(北京)技术有限公司 Corpus cleaning method, device, equipment and medium
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074634A1 (en) * 2004-10-06 2006-04-06 International Business Machines Corporation Method and apparatus for fast semi-automatic semantic annotation
CN102662930A (en) * 2012-04-16 2012-09-12 乐山师范学院 Corpus tagging method and corpus tagging device
CN108170670A (en) * 2017-12-08 2018-06-15 东软集团股份有限公司 Distribution method, device, readable storage medium storing program for executing and the electronic equipment of language material to be marked
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN109739956A (en) * 2018-11-08 2019-05-10 第四范式(北京)技术有限公司 Corpus cleaning method, device, equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144088A (en) * 2019-12-09 2020-05-12 深圳市优必选科技股份有限公司 Corpus management method, corpus management device and electronic equipment
CN112329430A (en) * 2021-01-04 2021-02-05 恒生电子股份有限公司 Model training method, text similarity determination method and text similarity determination device
CN112925910A (en) * 2021-02-25 2021-06-08 中国平安人寿保险股份有限公司 Method, device and equipment for assisting corpus labeling and computer storage medium

Similar Documents

Publication Publication Date Title
JP6594534B2 (en) Text information processing method and device
TWI729472B (en) Method, device and server for determining feature words
US10990511B2 (en) Apparatus and application interface traversing method
US10803241B2 (en) System and method for text normalization in noisy channels
CN107741937A (en) A kind of data query method and device
CN110427622A (en) Appraisal procedure, device and the storage medium of corpus labeling
CN102830924B (en) A kind of method and device adjusting entering method keyboard
US9183598B2 (en) Identifying event-specific social discussion threads
CN104423623B (en) It is a kind of to select word treatment method and electronic equipment
CN105335653A (en) Abnormal data detection method and apparatus
EP3953853A1 (en) Leveraging a collection of training tables to accurately predict errors within a variety of tables
CN110069769A (en) Using label generating method, device and storage equipment
CN104102704A (en) System control displaying method and system control displaying device
CN111221690B (en) Model determination method and device for integrated circuit design and terminal
CN110059312A (en) Short phrase picking method, apparatus and electronic equipment
CN112817514B (en) Content extraction method and device
US20180275981A1 (en) Determining candidate patches for a computer software
CN110765237B (en) Document processing method and device, storage medium and electronic equipment
CN111737398A (en) Method and device for searching sensitive words in text, electronic equipment and storage medium
CN111753548A (en) Information acquisition method and device, computer storage medium and electronic equipment
CN109803173A (en) A kind of video transcoding method, device and storage equipment
CN110807330A (en) Semantic understanding model evaluation method and device and storage medium
JP6667452B2 (en) Method and apparatus for inputting text information
CN109800099A (en) A kind of restoring method, storage medium and the terminal device of user's operation behavior
CN114880242B (en) Test case extraction method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination