CN110209764A

CN110209764A - The generation method and device of corpus labeling collection, electronic equipment, storage medium

Info

Publication number: CN110209764A
Application number: CN201811048957.8A
Authority: CN
Inventors: 陆笛
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2019-09-06
Anticipated expiration: 2038-09-10
Also published as: WO2020052405A1; CN110209764B

Abstract

Present invention discloses a kind of generation method of corpus labeling collection and device, electronic equipment, computer readable storage mediums.Technical solution provided by the invention, by obtaining corpus to be marked from inquiry log, it obtains in many ways to the annotation results of query statement in the corpus, filters out the similar query statement of annotation results, and then constitute corpus labeling collection by these query statements and its corresponding annotation results.Since the query statement of corpus labeling collection belongs to the similar query statement of multi-party annotation results, so the annotation results of corpus labeling Integrated query sentence are smaller a possibility that there are disagreements, the accuracy of annotation results is higher, and then the training of the Data Analysis Models such as intention assessment model is carried out using the higher corpus labeling collection of the accuracy as training set, the accuracy of Data Analysis Model can be improved.

Description

The generation method and device of corpus labeling collection, electronic equipment, storage medium

Technical field

The present invention relates to field of computer technology, in particular to the generation method and device, electronics of a kind of corpus labeling collection Equipment, computer readable storage medium.

Background technique

In interactive voice field, mainly the query statement that user inputs is carried out by various Data Analysis Models online Analysis, identification user are intended to, provide for user and accurately reply.And Data Analysis Model is by a large amount of inquiries to having marked Sentence (abbreviation training set) is trained.So in training set query statement annotation results accuracy, directly affect Data Analysis Model it is accurate, determine the intelligent level of voice interactive function.

Currently, mainly manually being marked by mark personnel to query statement.For example, marking out the inquiry of query statement Be intended to (including chat intention, Music on Demand intention, weather lookup intention etc.).So the human-subject test of mark personnel determines The mark accuracy of query statement.

Since the human-subject test of mark personnel may be different from the degree of awareness of ordinary person, or some query statement is recognized Know that there are deviations, therefore be easy to the query statement for making training set be included mark inaccuracy, in turn results in the number that training obtains It is larger according to analysis model error, it can not provide for user and accurately reply.

Summary of the invention

In order to solve to cause present in the relevant technologies due to the cognitive presence deviation for marking personnel to train Integrated query language The problem of the annotation results inaccuracy of sentence, the present invention provides a kind of generation methods of corpus labeling collection.

On the one hand, the present invention provides a kind of generation methods of corpus labeling collection, comprising:

Obtain inquiry log；The inquiry log includes query statement；

The extraction that query statement to be marked is carried out from the inquiry log, obtains corpus to be marked；

It obtains in many ways to the annotation results of query statement in the corpus to be marked；

According to the annotation results in many ways to same query statement, annotation results phase is filtered out from the corpus to be marked As query statement；

By the similar query statement of the annotation results and corresponding annotation results, corpus labeling collection is generated.

On the other hand, the present invention provides the generating means of another corpus labeling collection characterized by comprising

Log acquisition module, for obtaining inquiry log；The inquiry log includes query statement；

Corpus obtains module and obtains for carrying out the extraction of query statement to be marked from the inquiry log wait mark Infuse corpus；

As a result module is obtained, for obtaining in many ways to the annotation results of query statement in the corpus to be marked；

Sentence screening module, for basis in many ways to the annotation results of same query statement, from the corpus to be marked In filter out the similar query statement of annotation results；

Mark collection generation module, for generating by the similar query statement of the annotation results and corresponding annotation results Corpus labeling collection.

Further, the present invention provides a kind of electronic equipment, the electronic equipment includes:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to executing the generation method of above-mentioned corpus labeling collection.

Further, the present invention provides a kind of computer readable storage medium, the computer readable storage medium is deposited Computer program is contained, the computer program can be executed the generation method for completing above-mentioned corpus labeling collection by processor.

The technical solution that the embodiment of the present invention provides can include the following benefits:

Technical solution provided by the invention obtains multiple users couple by obtaining corpus to be marked from inquiry log The annotation results of query statement in the corpus filter out the identical query statement of annotation results, and then by these query statements And its corresponding annotation results constitute corpus labeling collection.Since to belong to multi-party annotation results identical for the query statement of corpus labeling collection Query statement, so the annotation results of corpus labeling Integrated query sentence are smaller a possibility that there are disagreements, annotation results Accuracy is higher, and then the higher corpus labeling collection of the accuracy is carried out the data such as intention assessment model as training set and is analyzed The accuracy of Data Analysis Model can be improved in the training of model.

It should be understood that the above general description and the following detailed description are merely exemplary, this can not be limited Invention.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and in specification together principle for explaining the present invention.

Fig. 1 is the schematic diagram of related implementation environment according to the present invention；

Fig. 2 is a kind of block diagram of server shown according to an exemplary embodiment；

Fig. 3 is a kind of flow chart of the generation method of corpus labeling collection shown according to an exemplary embodiment；

Fig. 4 is the annotation results schematic diagram of a variety of mark tasks；

Fig. 5 is the division principle schematic diagram of a variety of corpus labeling collection；

The influence curve schematic diagram of the corpus labeling the set pair analysis model performance of each batch of Fig. 6；

Fig. 7 is the details flow chart of step 330 in Fig. 3 corresponding embodiment；

Fig. 8 is the generating principle schematic diagram of corpus labeling collection shown according to an exemplary embodiment；

Fig. 9 is the details flow chart of step 350 in Fig. 3 corresponding embodiment；

Figure 10 is the details flow chart of step 370 in Fig. 3 corresponding embodiment；

Figure 11 is a kind of flow chart of the generation method of corpus labeling collection on the basis of Fig. 3 corresponding embodiment；

Figure 12 is a kind of block diagram of the generating means of corpus labeling collection shown according to an exemplary embodiment；

Figure 13 is the details block diagram that corpus obtains module in Figure 13 corresponding embodiment；

Figure 14 is the details block diagram that result obtains module in Figure 13 corresponding embodiment；

Figure 15 is the details block diagram of sentence screening module in Figure 13 corresponding embodiment.

Specific embodiment

Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.

Fig. 1 is implementation environment schematic diagram according to the present invention shown according to an exemplary embodiment.Involved by the present invention And implementation environment include server 110.It is stored with inquiry log in server 110, so that server 110 can use this hair The generation method of the corpus labeling collection of bright offer generates corpus labeling collection using inquiry log, improves corpus labeling Integrated query The accuracy of sentence annotation results.

As needed, which will also include providing data, the i.e. data source of inquiry log.Specifically, In this implementation environment, data source can be intelligent terminal 130.The inquiry that the available intelligent terminal 130 of server 110 uploads Then log uses method provided by the invention, generate corpus labeling collection.Intelligent terminal 130 can be smart phone, intelligent sound It rings, tablet computer.

It should be noted that the generation method of corpus labeling collection of the present invention, is not limited to dispose in server 110 corresponding Logic is handled, the processing logic being deployed in other machines is also possible to.For example, in the terminal device for having computing capability Deployment generates the processing logic etc. of corpus labeling collection.

Referring to fig. 2, Fig. 2 is a kind of server architecture schematic diagram provided in an embodiment of the present invention.The server 200 can be because matching It sets or performance is different and generate bigger difference, may include one or more central processing units (central Processing units, CPU) 222 (for example, one or more processors) and memory 232, one or more Store the storage medium 230 (such as one or more mass memory units) of application program 242 or data 244.Wherein, it deposits Reservoir 232 and storage medium 230 can be of short duration storage or persistent storage.The program for being stored in storage medium 230 may include One or more modules (diagram is not shown), each module may include to the series of instructions operation in server 200. Further, central processing unit 222 can be set to communicate with storage medium 230, execute storage medium on server 200 Series of instructions operation in 230.Server 200 can also include one or more power supplys 226, one or more Wired or wireless network interface 250, one or more input/output interfaces 258, and/or, one or more operations System 241, such as Windows Server^TM, Mac OS^XTM, UnixTM, Linux^TM, FreeBSD^TMEtc..Following Fig. 3, Fig. 7, The step as performed by server described in Fig. 8, Figure 10-embodiment illustrated in fig. 12 can be based on the server shown in Fig. 2 Structure.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of following embodiments can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

Fig. 3 is a kind of flow chart of the generation method of corpus labeling collection shown according to an exemplary embodiment.The corpus The scope of application and executing subject for marking the generation method of collection can be server, which, which can be, implements ring shown in Fig. 1 The server 110 in border.As shown in figure 3, the generation method of the corpus labeling collection can be executed by server 110, may include with Lower step.

In the step 310, inquiry log is obtained；

Wherein, inquiry log refers to that equipment collects the record that user input query sentence is carried out, which can be Intelligent sound, mobile terminal etc..Inquiry log may include time point, the query statement of user's input, looking into user's return Ask result etc..Wherein, the query statement of user's input can be text or speech form.Inquiry log may include one or more A large amount of query statements of a user's input, thus, inquiry log can regard the life corpus comprising a large amount of query statements as. So-called raw corpus, refers to the query statement for belonging to original real user, without artificial mark.

In a step 330, the extraction that query statement to be marked is carried out from the inquiry log obtains corpus to be marked Collection；

It should be noted that due in inquiry log include a large amount of query statements, but and not all query statement all Effectively, some may be the not representative that arbitrarily inputs of user in all senses, some query statements may be too long or too short, also Much query statements may is that duplicate, if can reduce using the annotation results of these query statements as corpus labeling collection Corpus labeling concentrates the accuracy of annotation results, and then influences the data point for obtaining corpus labeling collection as training sample training Analyse the accuracy of model.

The present invention can extract inquiry language to be marked according to pre-configured strategy from inquiry log as a result, Sentence, constitutes corpus to be marked by query statement to be marked.Wherein, the extraction for carrying out query statement to be marked, can be pair Inquiry log is analyzed, and according to useless/deactivated character repertoire of configuration, query statement of the removal comprising useless/deactivated character is gone Except the inquiry language that meaningless query statement (such as several characters of the not continuity arbitrarily inputted), removal are too long or too short Sentence, the duplicate query statement of removal, remove the query statement marked, obtain last remaining query statement and are used as wait mark The query statement of note.

In step 350, it obtains in many ways to the annotation results of query statement in the corpus to be marked；

Wherein, multiple mark personnel, more tagging equipments be can be in many ways, it can be with multiple mark journeys in an equipment Sequence, for indicating the annotation results of query statement in corpus to be marked, there are multiple sources, for ease of description, hereafter will mark Personnel, tagging equipment or marking program are referred to as mark side.It each mark side can be to the query statement in corpus to be marked It is labeled and (referred to as " votes ").Mark, which refers to, adds tag along sort for the query statement in corpus to be marked, multiple " to throw Ticket " result just can reflect the correct classification of query statement.

Annotation results are exactly the tag along sort that mark side is query statement addition.According to the difference of mark task, mark knot Fruit can be intended to annotation results, NER (Named Entity Recognition names Entity recognition) annotation results, slot position Annotation results or participle annotation results.Wherein, it is intended that annotation results refer to intent classifier as a result, such as " today feels out of one's plate ", Mark side is " chat and be intended to " to the intention annotation results of the query statement；Such as " please carrying out a song releived to me ", mark Note side is " Music on Demand intention " to the intention annotation results of the query statement.

NER annotation results refer to the name marked out in query statement, place name, mechanism name, proper noun etc..Slot position mark As a result refer to and add slot position label for each phrase in query statement, such as weather business scope, slot position label having time word, Place word, weather business keyword, weather phenomenon word, interrogative etc..Participle annotation results refer to query statement is divided into it is more A phrase, multiple phrases can regard a tag along sort as participle annotation results, each phrase.

As shown in figure 4, be directed to corpus to be marked, mark side can carry out intention mark, NER mark, slot position mark or Participle mark obtains the annotation results of each mark task.Specifically, each party can be first to the inquiry in corpus to be marked Sentence carries out intention mark (according to being intended to mark document specification), obtains the intention mark for being intended to annotation results comprising query statement Collection.And then query statement can be carried out to field division according to mark is intended to, and carry out NER mark simultaneously in ready-portioned field (marking document specification according to NER) and slot position mark (marking document specification according to slot position), respectively obtains comprising NER annotation results NER mark collection and slot position comprising slot position annotation results mark collection.Wherein, while taking notice of icon note, each mark side is also It can carry out carrying out participle mark to corpus to be marked, obtain the participle mark collection comprising participle annotation results.

Wherein, it is intended that mark collection, slot position mark collection, NER mark collection or participle mark collection can store depositing in server In storage media, server can be from obtaining in many ways to the annotation results of query statement in corpus to be marked in storage medium.

In step 370, it according to the annotation results in many ways to same query statement, is screened from the corpus to be marked The similar query statement of annotation results out；

Wherein, the similar query statement of annotation results refers to that multi-party annotation results are consistent or similar query statement, in many ways The similarity of annotation results is greater than preset value, it is believed that be the similar query statement of annotation results, the preset value can with 80%, 90%.

In one embodiment, it is assumed that annotation results are to be intended to annotation results, and server is got in many ways to language to be marked Expect the intention annotation results of Integrated query sentence, be successively directed to every query statement, compares in many ways to the meaning of this query statement As a result, judgement is in many ways to the annotation results of this query statement, unanimously whether (similarity of annotation results is greater than default icon note Value is i.e. it is believed that consistent), and then the consistent query statement of multi-party annotation results is filtered out from corpus to be marked.

Specifically, single label for labelling collection is added if multi-party annotation results are consistent for the same query statement. If it is inconsistent, absolute decree personnel is needed to assist to examine inconsistent concrete condition:

I) it if it exceeds mark side's mark of half is consistent, takes multi-party consistent label as annotation results, single mark is added Label mark collection；

Ii) if inconsistent results are distributed as 1:1, it may be possible to the case where multi-tag (use-case can mark multiple labels). Auditor audits the case where determination is multi-tag, then multi-tag mark collection is added；

Iii) if multi-party annotation results are all inconsistent, it may be possible to which multi-tag sample or difficult sample after audit, are added In multi-tag mark collection or difficult sample set.

In this way, a mark task finally obtains three mark collection: single label for labelling collection, multi-tag by mark process Mark collection and difficult sample set.The query statement that single label for labelling is concentrated can consider the mark knot to the query statement in many ways Fruit is identical.Single label for labelling collection, which can consider, belongs to reliable mark collection, can be used as training set, the test of intention assessment model Collection etc..

Similarly, it is assumed that annotation results are NER annotation results, slot position annotation results or participle annotation results, can also be screened The identical query statement of multi-party annotation results out.

In step 390, by the similar query statement of the annotation results and corresponding annotation results, corpus labeling is generated Collection.

Wherein, corpus labeling collection includes query statement and its corresponding annotation results, wherein the query statement belongs to step The 370 similar query statements of multi-party annotation results filtered out.The multi-party annotation results that server by utilizing filters out are similar to be looked into Sentence and the annotation results of the query statement are ask, the inquiry corpus labeling being made of the query statement and annotation results is generated Collection.

As shown in figure 5, corpus to be marked is marked by multiple mark sides for the query statement in corpus to be marked Note, server obtain the annotation results 1 of query statement in 1 pair of mark side corpus to be marked, 2 pairs of mark side corpus to be marked The annotation results 2 of middle query statement, the annotation results 3 of query statement in 3 pairs of mark side corpus to be marked, mark side 4 treats Mark the annotation results 4 of query statement in corpus.Server ties annotation results 1, annotation results 2, annotation results 3, mark Fruit 4 merges, and filters out the consistent query statement of four annotation results and single label for labelling collection is added, if some query statement In the presence of being more than that half annotation results are consistent, it is also contemplated that multi-party annotation results are identical, taking the consistent annotation results of more people to be used as should The annotation results of query statement, and single label for labelling collection is added in the query statement, using the list label for labelling collection as corpus mark Note collection, can merge with the corpus marked, as training set, test set.

As shown in fig. 5, it is assumed that the annotation results 1,2,3,4 of certain query statements are inconsistent, and inconsistent results are distributed as The case where 1:1, then these query statements are likely to be multi-tag, audits the case where determination is multi-tag by auditor, will Multi-tag mark collection is added in these query statements.Assuming that the annotation results 1,2,3,4 of certain query statements are all inconsistent, having can It can be multi-tag sample or difficult sample, then these query statements can be added to multi-tag mark collection or difficult mark collection.

It is to be understood that corpus labeling collection is exactly single label for labelling collection, included is that multi-party annotation results are identical Query statement.That is, the annotation results of corpus labeling Integrated query sentence are not present disagreement, the accuracys of annotation results compared with Height, thus, it is possible to carry out the training of Data Analysis Model using corpus labeling collection as the higher training set of accuracy or test set.

For example, it is assumed that annotation results are intended to annotation results, corpus labeling collection includes being intended to the identical inquiry of annotation results Sentence and its corresponding intention annotation results, then corpus labeling collection can be used as the training that training set carries out intention assessment model. Assuming that annotation results are NER annotation results, corpus labeling collection includes the identical query statement of NER annotation results and its corresponding NER annotation results, then corpus labeling collection can be used as the training that training set is named entity recognition model.Similarly, it is assumed that mark Note is the result is that slot position annotation results, then corpus labeling collection can be used as the training that training set carries out slot position marking model, it is assumed that mark Note is the result is that participle annotation results, then corpus labeling collection can be used as the training that training set carries out participle marking model.

The technical solution that the above exemplary embodiments of the present invention provide, by obtaining corpus to be marked from inquiry log Collection, obtains multiple users to the annotation results of query statement in the corpus, filters out the identical query statement of annotation results, into And corpus labeling collection is constituted by these query statements and its corresponding annotation results.Since the query statement of corpus labeling collection belongs to The multi-party similar query statement of annotation results, so the annotation results of corpus labeling Integrated query sentence a possibility that there are disagreements Smaller, the accuracy of annotation results is higher, and then carries out intention knowledge using the higher corpus labeling collection of the accuracy as training set The training of the Data Analysis Models such as other model, can be improved the accuracy of Data Analysis Model.

As needed, existing training set, re -training can be added in corpus labeling collection by way of increment, superposition Data Analysis Model, and using the performance of the same test set test data analysis model, assess newly-increased corpus labeling collection pair Model performance bring effect promoting reflects the quality and value of newly-increased corpus labeling collection.

By taking the corpus labeling collection of intention as an example.Results of property of the identification model on test set will be realized as benchmark.It Afterwards, model training is added in the every batch of corpus labeling collection of acquisition to concentrate, records the model trained after every batch of data is added Performance indicator.The performance for the model that the corpus labeling collection of each batch of curve record as shown in Figure 6 trains after being added, In the 6th batch of (s6) data it is obvious to the performance gain of training pattern, can select the batch corpus labeling collection be added training number In.

In addition, if mark side is mark personnel, personnel's mutually cheating reference in annotation process is marked in order to prevent, it can It is marked in the method for taking the mark that is staggered in each period.As shown in table 1 below.

Table 1 is staggered the task calendar of mark

	First day	Second day	Third day	4th day	5th day
						Personnel 1	Document 1	Document 5	Document 4	Document 3	Document 2
Personnel 2	Document 2	Document 1	Document 5	Document 4	Document 3
						Personnel 3	Document 3	Document 2	Document 1	Document 5	Document 4
Personnel 4	Document 4	Document 3	Document 2	Document 1	Document 5

It is labeled with four mark personnel as example, marks personnel in order to prevent and mutually refer to annotation results, mark people The content that is labeled on the same day of member is different, can the mark of the planning chart arrangement according to table 1, obtained with five days for a cycle It takes result and counts annotation results consistent and inconsistent between more people.

In a kind of exemplary embodiment, as shown in fig. 7, above-mentioned steps 330 specifically include:

In step 331, the query statement that preset condition is unsatisfactory in the inquiry log is removed；

Wherein, the query statement for being unsatisfactory for preset condition may include one or more of form: comprising useless/deactivated The query statement of character, it is not intended to adopted query statement, too long or too short query statement, duplicate query statement etc., to keep away Exempt from it is subsequent these nugatory query statements are labeled, both increase workload, also affect the standard of corpus labeling collection True property.

In step 332, by query statement remaining in the inquiry log, the multiple Tag Estimation moulds constructed are inputted Type exports multiple Tag Estimation models to the Tag Estimation result of same query statement；The multiple Tag Estimation model passes through It is obtained using different training sample set training；

Specifically, Tag Estimation model can be the intention assessment model that query statement is intended to for identification.Correspondingly, mark Label prediction result can be intention assessment result.Tag Estimation model can use the known a large amount of inquiry languages for being intended to annotation results Sentence (i.e. training sample set) training obtains.Multiple Tag Estimation models can be obtained using different training sample training.For example, The known all query statements for being intended to annotation results are divided into 4 batches, the training of every batch of query statement obtains corresponding intention assessment mould Type, thus, it is possible to obtain 4 intention assessment models.

After removing above-mentioned undesirable query statement, query statement remaining in inquiry log is inputted 4 respectively A intention assessment model exports 4 intention assessment models to the intention assessment result of same query statement.

It should be noted that according to mark task difference, Tag Estimation model be also possible to Named Entity Extraction Model, Slot position marking model or participle model, these models can be obtained by the training of a large amount of query statements of known NER annotation results, Know that a large amount of query statements training of slot position annotation results obtains, it is known that a large amount of query statements training for segmenting annotation results obtains. Similarly, Tag Estimation result can be corresponding name Entity recognition result, slot position annotation results, word segmentation result.Tag Estimation The building mode of model belongs to the prior art, and details are not described herein.

In step 333, according to the multiple Tag Estimation model to the Tag Estimation of same query statement as a result, from institute It states and filters out the inconsistent query statement of Tag Estimation result in remaining query statement, obtain the corpus to be marked.

Since boundary sample point is significant for the boundary of training pattern, if finding more intentions, on different classes of There is the sample point of certain probability distribution, training set is added in this kind of sample point and carries out model training, is had been able to compared to Training set is added in accurate sample point of classifying, and helps the performance boost of model bigger.

The present invention is according to multiple Tag Estimation models to the Tag Estimation of same query statement as a result, from inquiry log residue Query statement in filter out the inconsistent query statement of Tag Estimation result.That is, model is to these query statements It identifies that accuracy is lower, so these query statements are regarded as boundary sample point, these boundary sample points is added wait mark It infuses corpus and carries out model training, the accuracy of model can be improved.

In a kind of exemplary embodiment, above-mentioned steps 331 be may comprise steps of:

Classified by the classifier constructed to the query statement recorded in the inquiry log, and removes and classify The meaningless query statement arrived.

Wherein, it is not intended to which adopted query statement refers to not specifically intended sentence, it may be possible to user's mistake or arbitrarily input Sentence.Classifier i.e. disaggregated model, the effect of classifier are the query statements identified in inquiry log, which is intentional Justice, which is meaningless.It specifically can be trained by a large amount of significant query statements and meaningless query statement To classifier.For example, can be to by a large amount of significant query statements and meaningless query statement, training logic is returned The parameter for returning model, obtains classifier.Classifier is the general designation for the method classified in data mining to sample, classifier Building mode includes decision tree, logistic regression, naive Bayesian, neural network scheduling algorithm.

Specifically, the query statement in inquiry log can be inputted trained classifier, exports significant or be not intended to The judgement of justice as a result, can remove meaningless query statement in inquiry log in turn.It optionally, can also be according to configured Useless character or deactivated character repertoire remove the query statement in inquiry log comprising useless character or deactivated character.

In another exemplary embodiment, above-mentioned steps 331 can with the following steps are included:

According to the query statement set marked, remove the query statement marked in the inquiry log and with marked Infuse the similar query statement of query statement.

Wherein, the query statement set marked refers to the set of the query statement of known annotation results.What is marked looks into Asking sentence set can be the corpus labeling collection generated.It, can according to query statement included in the query statement set To belong to the query statement in the set from removal in the query statement of inquiry log.It has marked query statement and has just referred to the inquiry Query statement in sentence set, query statement similar with query statement has been marked can be by calculating between query statement Similarity is found out in inquiry log and has marked the higher query statement of query statement similarity, to remove in inquiry log With marked the higher query statement of query statement similarity.

That is, remaining query statement can be and remove meaningless inquiry in inquiry log in above-mentioned steps 332 The query statement and remove and marked that the query statement of sentence, removal comprising useless character or deactivated character, removal have marked After infusing the similar query statement of query statement, remaining query statement in inquiry log.

Remove the query statement for only including single entity word in the inquiry log, sentence length is greater than preset characters quantity Query statement or duplicate query statement.

Wherein, entity word refers to the title of true specific things, for example, song title, singer's name etc..It only include a reality The query statement of pronouns, general term for nouns, numerals and measure words, it is difficult to distinguish and be intended to, segment etc., so being not suitable for addition corpus labeling collection participates in modeling.Sentence length Query statement greater than preset characters quantity refers to longer query statement, and this kind of query statement mark difficulty is big, and participates in building Since query statement length is longer when mould, calculation amount undoubtedly will increase, be also not suitable for addition corpus labeling collection participation as a result, and build Mould.Likewise, duplicate query statement is also not necessarily to that the participation modeling of corpus labeling collection is added in inquiry log, so removal weight Multiple query statement, such as three query statements repeat, and can remove 2 and only retain a query statement.

To sum up, remaining query statement can also be the inquiry eliminated only comprising single entity word in above-mentioned steps 332 After sentence, sentence length are greater than the query statement or duplicate query statement of preset characters quantity, finally remained in inquiry log Remaining query statement.

As shown in figure 8, pre-processing for newly-increased query statement to newly-increased query statement, meaningless look into is removed Ask sentence, remove it is useless/deactivate character, remove the query statement of single entity word, remove overlength, duplicate query statement, and And the query statement marked, removal and the query statement collection marked can be removed according to the query statement set marked The very high query statement of similarity in conjunction, further, 332 and 333 to filter out Tag Estimation result different through the above steps The query statement of cause, the query statement filtered out constitute corpus to be marked.In turn, according in many ways to corpus to be marked Annotation results can filter out the similar query statement of annotation results and generate corpus labeling collection.And then corpus labeling collection can add Enter the query statement set marked, participates in the training of model together.

In a kind of exemplary embodiment, as shown in figure 9, above-mentioned steps 350 specifically include:

In step 351, Xiang Duofang distributes the mark task to the corpus to be marked, the group of the mark task Hair, triggering execute the mark task parallel in many ways；

Wherein, mark task can be intended to mark task, NER mark task, slot position mark task or participle mark times Business.For example, more tagging equipments be can be in many ways, server Xiang Duotai tagging equipment, which issues, carries corpus to be marked Mark task, more tagging equipments of triggering execute mark task parallel.Pass through in advance it should be noted that tagging equipment can be The intelligent dimension equipment that the training of great amount of samples data obtains.Every tagging equipment is trained using different sample data sets, So the mark precision of every tagging equipment is different.

In one embodiment, server can be issued to terminal device belonging to multiple mark personnel carries language to be marked Expect the mark task of collection.The affiliated terminal device of mark personnel can carry out the displaying of collection corpus to be marked and proposing for mark task Show.User can click option or draw take by way of carry out intention mark, NER mark, slot position mark and participle mark, it is more A affiliated terminal device of mark personnel is according to user clicks on tabs or draws the operation acquisition annotation results taken, completes to language to be marked Expect the mark task of collection.

In a kind of exemplary embodiment, above-mentioned mark task is distributed, and triggers multi-party parallel execution mark task, specifically It include: distributing for the mark task, triggering is multi-party and is about to the marking model that the corpus to be marked inputs itself configuration, Output is respectively to the annotation results of the corpus to be marked；Wherein, the marking model configured in many ways uses different training samples This training is got.

That is, Duo Tai tagging equipment or multiple marking programs can be represented in many ways herein.Each mark side is configured with Marking model, since the marking model configured in many ways is obtained using different training sample set training, so more tagging equipments Or multiple marking programs have different mark precision.It should be noted that the marking model configured in many ways in the embodiment is adopted Training sample set and the above Tag Estimation model training sample set used are also different.For example, can will own Sample is divided into 10 training sample sets, and each training sample set passes through the available corresponding model of training, and then can be by 10 A model, a part are used as Tag Estimation model, and a part is used as marking model, utilizes multiple Tag Estimation model discrimination bids The inconsistent query statement of prediction result is signed, corpus to be marked is obtained, is calculated later using multi-party marking model to be marked The annotation results of query statement in corpus obtain in many ways to the annotation results of query statement in corpus to be marked.

Assuming that referring to the multiple marking programs disposed in server in many ways, multiple marking programs can execute following step parallel It is rapid: corpus to be marked being inputted to the marking model constructed in advance, exports the annotation results to corpus to be marked.Marking model Building mode be referred to the building of Tag Estimation model.

In step 352, it receives and executes the annotation results that the mark task returns parallel in many ways.

More tagging equipments or multiple affiliated terminal devices of mark personnel, the parallel mark task that executes obtain annotation results, And annotation results are back to server, server receives more tagging equipments or the affiliated terminal device of multiple mark personnel returns Annotation results.Corresponding with mark task, annotation results can be intended to annotation results, NER annotation results, slot position annotation results Or participle annotation results.

In a kind of exemplary embodiment, corpus to be marked includes that a plurality of of known label information buries point statement；It buries a little Sentence refers to the query statement of known accurate annotation results, is and distinguishes in many ways to the annotation results for burying point statement, buries a little The accurate annotation results of sentence are known as label information, and as shown in Figure 10, above-mentioned steps 370 specifically include:

It is more described a plurality of to bury point statement according in many ways to a plurality of annotation results for burying point statement in step 371 Annotation results it is whether consistent with corresponding label information, the accuracy rate of multi-party annotation results is calculated；

It should be noted that needing first to judge every when screening corpus to be marked according to multi-party annotation results The annotation results accuracy rate of a mark side, to remove the annotation results that the lower mark side of accuracy rate provides.

The accuracy rate of multi-party annotation results refers to that each mark side to a plurality of accuracy burying point statement and being labeled, passes through The calculating of accuracy is marked for assessing the mark accuracy rate of current mark side to point statement is buried.In annotation process, using " burying The mode of point " has carried out accuracy rate verification to each mark side.Wherein it is possible to be from the data set that last consignment of mark is completed More than the 5% consistent query statement of people is extracted as a plurality of of present lot known label information and buries point statement.For each mark Side, can be according to the mark side to a plurality of annotation results for burying point statement of known label information, and compares this and a plurality of bury a little Whether the annotation results of sentence are consistent with known label information, calculate annotation results and the consistent accounting of label information, obtain The annotation results accuracy rate of the mark side.

In step 372, according to the accuracy rate of the multi-party annotation results, it is not up to standard that accuracy rate is rejected from multi-party source Annotation results source.

Specifically, can be with given threshold, according to the annotation results accuracy rate of each user, accuracy rate is less than the mark of threshold value Side, which may be considered, provides accuracy rate annotation results not up to standard.It is possible thereby to delete the mark that this kind of accuracy rate is less than threshold value The annotation results just provided.

Alternatively, according to the annotation results accuracy rate of each mark side, to all mark sides carry out accuracy rate from high to low into Row sequence, several mark sides to sort rearward may be considered accuracy rate not mark side up to standard.Thus, it is possible to remove accuracy rate The not annotation results that mark side up to standard provides.

In step 373, according to the annotation results in remaining source, multi-source mark is filtered out from the corpus to be marked Infuse the similar query statement of result.

The annotation results in remaining source refer to from the annotation results provided in many ways, delete accuracy rate not mark side up to standard After annotation results are provided, annotation results of the remaining mark side to corpus to be marked.That is to say, subsequent from corpus to be marked When filtering out the similar query statement of multi-party annotation results, the annotation results that are not provided further according to mark side not up to standard.According to The remaining higher mark side of accuracy rate filters out multiple marks to the annotation results of corpus to be marked from corpus to be marked The similar query statement of square annotation results.

In a kind of exemplary embodiment, as shown in figure 11, the generation method of corpus labeling collection provided by the invention is also wrapped It includes:

In step 1101, according to the annotation results in many ways to same query statement, sieved from the corpus to be marked Select the inconsistent query statement of annotation results；

It should be noted that boundary sample is for model optimization, to depict clearer classification boundaries very useful.Its In, boundary sample can screen use from multi-party inconsistent sample.Specifically, server can be according in many ways to same The annotation results of one query statement filter out the inconsistent query statement of multi-party annotation results from corpus to be marked.

In step 1102, the query statement that multi-tag is obtained from the inconsistent query statement of the annotation results, is obtained It must be used to carry out the boundary sample point of Data Analysis Model optimization.

For the inconsistent query statement of multiple user annotation results, more marks can therefrom be obtained by auditor's audit The query statement (query statement that can have multiple annotation results) of label, the query statement of this kind of multi-tag may be considered side The identification difficulty of boundary's sample point, this kind of query statement is larger, so if model can accurately identify the meaning of this kind of query statement Figure, slot position etc. will greatly improve the accuracy rate of model.Data Analysis Model can be intention assessment model, name Entity recognition Model, slot position marking model, participle model etc..The optimization that Data Analysis Model is carried out by this kind of query statement, can be improved The recognition accuracy of model.

For example, " today feels out of one's plate, and please carrys out a song releived to me." query statement its be intended to contain the spare time Merely it is intended to also contain Music on Demand intention, which belongs to the boundary property sample of intent classifier, and model training can be helped to go out standard True intention boundary.

Following is apparatus of the present invention embodiment, can be used for executing the corpus labeling that the above-mentioned server 110 of the present invention executes The generation method embodiment of collection.For undisclosed details in apparatus of the present invention embodiment, corpus labeling collection of the present invention is please referred to Generation method embodiment.

Figure 12 is a kind of block diagram of the generating means of corpus labeling collection shown according to an exemplary embodiment, the corpus mark Note collection generating means can be used in the server 110 of implementation environment shown in Fig. 1, execute Fig. 3, Fig. 7-Figure 11 it is any shown in The all or part of step of the generation method of corpus labeling collection.As shown in figure 12, which includes but is not limited to: log acquisition Module 1210, corpus obtain module 1230, result obtains module 1250, sentence screening module 1270 and mark collection generate mould Block 1290.

Log acquisition module 1210, for obtaining inquiry log；The inquiry log includes query statement；

Corpus obtains module 1230, for carrying out the extraction of query statement to be marked from the inquiry log, obtains Corpus to be marked；

As a result module 1250 is obtained, for obtaining in many ways to the annotation results of query statement in the corpus to be marked；

Sentence screening module 1270, for basis in many ways to the annotation results of same query statement, from the language to be marked Material is concentrated and filters out the similar query statement of annotation results；

Mark collection generation module 1290, is used for by the similar query statement of the annotation results and corresponding annotation results, Generate corpus labeling collection.

The function of modules and the realization process of effect are specifically detailed in the generation of above-mentioned corpus labeling collection in above-mentioned apparatus The realization process of step is corresponded in method, details are not described herein.

Log acquisition module 1210 such as can be some physical structure wired or wireless network interface 250 in Fig. 2.

Corpus obtains module 1230, result obtains module 1250, sentence screening module 1270 and mark collection generate mould Block 1290 is also possible to functional module, the correspondence step in generation method for executing above-mentioned corpus labeling collection.It is appreciated that These modules can by hardware, software, or a combination of both realize.When realizing in hardware, these modules can be real It applies as one or more hardware modules, such as one or more specific integrated circuits.When being realized with software mode, these modules It may be embodied as the one or more computer programs executed on the one or more processors, such as the center reason device 222 of Fig. 2 The performed program being stored in memory 232.

In a kind of exemplary embodiment, as shown in figure 13, the corpus obtains module 1230 and includes:

Sentence removal unit 1231, for removing the query statement for being unsatisfactory for preset condition in the inquiry log；

Label prediction unit 1232, for by query statement remaining in the inquiry log, input to have constructed multiple Tag Estimation model exports multiple Tag Estimation models to the Tag Estimation result of same query statement；The multiple label is pre- Model is surveyed to obtain by using different training sample set training；

Sentence extraction unit 1233, for unit according to the multiple Tag Estimation model to the label of same query statement Prediction result filters out the inconsistent query statement of Tag Estimation result from the remaining query statement, obtain it is described to Mark corpus.

In a kind of exemplary embodiment, the sentence removal unit 1231 includes:

Classification removal subelement, for by the classifier that has constructed to the query statement recorded in the inquiry log into Row classification, and remove the meaningless query statement that classification obtains.

In a kind of exemplary embodiment, the sentence removal unit 1231 further include:

First removal subelement, for removing and having been marked in the inquiry log according to the query statement set marked Query statement and query statement similar with query statement has been marked.

Second removal subelement, for removing in the inquiry log only comprising query statement, the sentence of single entity word Length is greater than the query statement or duplicate query statement of preset characters quantity.

In a kind of exemplary embodiment, as shown in figure 14, the result obtains module 1250 and includes:

Task dispatch unit 1251, for the mark task distributed in many ways to the corpus to be marked, the mark Task distributes, and triggering executes the mark task parallel in many ways；

As a result receiving unit 1252, for receiving the annotation results for executing the mark task parallel in many ways and returning.

Wherein, distributing for task is marked, triggering executes the mark task parallel in many ways, comprising:

The mark task distributes, and triggering is multi-party and is about to the mark mould that the corpus to be marked inputs itself configuration Type exports respectively to the annotation results of the corpus to be marked；Wherein, the marking model configured in many ways uses different training Sample set training obtains.

In a kind of exemplary embodiment, the corpus to be marked includes that a plurality of of known label information buries point statement； As shown in figure 15, the sentence screening module 1270 includes:

Accuracy rate computing unit 1271, for according in many ways to a plurality of annotation results for burying point statement, relatively described in Whether a plurality of annotation results for burying point statement are consistent with corresponding label information, and the accuracy rate of multi-party annotation results is calculated；

Source culling unit 1272 is rejected quasi- for the accuracy rate according to the multi-party annotation results from multi-party source True rate annotation results source not up to standard；

Sentence screening unit 1273 is screened from the corpus to be marked for the annotation results according to remaining source The similar query statement of multi-source annotation results out.

Optionally, the present invention also provides a kind of electronic equipment, which can be used for the clothes of implementation environment shown in Fig. 1 Be engaged in device 110 in, execute Fig. 3, Fig. 7-Figure 11 it is any shown in corpus labeling collection generation method all or part of step.Institute Stating electronic equipment includes:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to executing the generation side of corpus labeling collection described in the above exemplary embodiments Method.

The processor of electronic equipment executes the concrete mode of operation in the related corpus labeling collection in the embodiment Detailed description is performed in the embodiment of generation method, no detailed explanation will be given here.

In the exemplary embodiment, a kind of storage medium is additionally provided, which is computer readable storage medium, It such as can be the provisional and non-transitorycomputer readable storage medium for including instruction.The storage medium is stored with computer Program, the computer program can be executed by the central processing unit 222 of server 200 to complete the generation of above-mentioned corpus labeling collection Method.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and change can executed without departing from the scope.The scope of the present invention is limited only by the attached claims.

Claims

1. a kind of generation method of corpus labeling collection characterized by comprising

Obtain inquiry log；The inquiry log includes query statement；

According to the annotation results in many ways to same query statement, it is similar from the corpus to be marked to filter out annotation results Query statement；

2. the method according to claim 1, wherein described carry out inquiry language to be marked from the inquiry log The extraction of sentence, obtains corpus to be marked, comprising:

Remove the query statement that preset condition is unsatisfactory in the inquiry log；

By query statement remaining in the inquiry log, the multiple Tag Estimation models constructed are inputted, multiple labels are exported Tag Estimation result of the prediction model to same query statement；The multiple Tag Estimation model is by using different training samples This training is got；

According to the multiple Tag Estimation model to the Tag Estimation of same query statement as a result, from the remaining query statement In filter out the inconsistent query statement of Tag Estimation result, obtain the corpus to be marked.

3. according to the method described in claim 2, it is characterized in that, being unsatisfactory for preset condition in the removal inquiry log Query statement, comprising:

Classified by the classifier constructed to the query statement recorded in the inquiry log, and removes what classification obtained Meaningless query statement.

4. according to the method described in claim 2, it is characterized in that, being unsatisfactory for preset condition in the removal inquiry log Query statement, comprising:

According to the query statement set marked, removes the query statement marked in the inquiry log and looked into having marked Ask the similar query statement of sentence.

5. according to the method described in claim 2, it is characterized in that, being unsatisfactory for preset condition in the removal inquiry log Query statement, comprising:

Remove the query statement for only including single entity word in the inquiry log, sentence length is greater than looking into for preset characters quantity Ask sentence or duplicate query statement.

6. the method according to claim 1, wherein the acquisition is to query statement in the corpus to be marked Annotation results, comprising:

To the mark task distributed in many ways to the corpus to be marked, the mark task is distributed, and triggering is held parallel in many ways The row mark task；

It receives and executes the annotation results that the mark task returns parallel in many ways.

7. according to the method described in claim 6, triggering is multi-party to be executed parallel it is characterized in that, the mark task distributes The mark task, comprising:

The mark task distributes, and triggering is multi-party and is about to the marking model that the corpus to be marked inputs itself configuration, Output is respectively to the annotation results of the corpus to be marked；Wherein, the marking model configured in many ways uses different training samples This training is got.

8. the method according to claim 1, wherein the corpus to be marked includes the more of known label information Item buries point statement；The basis to the annotation results of same query statement, screens bid from the corpus to be marked in many ways Infuse the similar query statement of result, comprising:

According in many ways to a plurality of annotation results for burying point statement, a plurality of annotation results for burying point statement with it is corresponding Whether label information is consistent, and the accuracy rate of multi-party annotation results is calculated；

According to the accuracy rate of the multi-party annotation results, accuracy rate annotation results source not up to standard is rejected from multi-party source；

According to the annotation results in remaining source, the similar inquiry of multi-source annotation results is filtered out from the corpus to be marked Sentence.

9. a kind of generating means of corpus labeling collection characterized by comprising

Corpus obtains module and obtains language to be marked for carrying out the extraction of query statement to be marked from the inquiry log Material collection；

Sentence screening module, for being sieved from the corpus to be marked according in many ways to the annotation results of same query statement Select the similar query statement of annotation results；

Mark collection generation module, for generating corpus by the similar query statement of the annotation results and corresponding annotation results Mark collection.

10. device according to claim 9, which is characterized in that the corpus obtains module and includes:

Sentence removal unit, for removing the query statement for being unsatisfactory for preset condition in the inquiry log；

Label prediction unit, for inputting the multiple Tag Estimations constructed for query statement remaining in the inquiry log Model exports multiple Tag Estimation models to the Tag Estimation result of same query statement；The multiple Tag Estimation model is logical It crosses and is obtained using different training sample set training；

Sentence extraction unit, for unit according to the multiple Tag Estimation model to the Tag Estimation knot of same query statement Fruit filters out the inconsistent query statement of Tag Estimation result from the remaining query statement, obtains the language to be marked Material collection.

11. device according to claim 10, which is characterized in that the sentence removal unit includes:

Classification removal subelement, divides the query statement recorded in the inquiry log for the classifier by having constructed Class, and remove the meaningless query statement that classification obtains.

12. device according to claim 9, which is characterized in that the result obtains module and includes:

Task dispatch unit, for the mark task distributed in many ways to the corpus to be marked, the group of the mark task Hair, triggering execute the mark task parallel in many ways；

As a result receiving unit, for receiving the annotation results for executing the mark task parallel in many ways and returning.

13. device according to claim 9, which is characterized in that the corpus to be marked includes known label information It is a plurality of to bury point statement；The sentence screening module includes:

Accuracy rate computing unit, it is more described a plurality of to bury a little for according in many ways to a plurality of annotation results for burying point statement Whether the annotation results of sentence are consistent with corresponding label information, and the accuracy rate of multi-party annotation results is calculated；

Source culling unit is rejected accuracy rate from multi-party source and is not reached for the accuracy rate according to the multi-party annotation results Target annotation results source；

Sentence screening unit filters out multi-source from the corpus to be marked for the annotation results according to remaining source The similar query statement of annotation results.

14. a kind of electronic equipment, which is characterized in that the electronic equipment includes:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to perform claim requires the generation method of corpus labeling collection described in 1-8 any one.

15. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program can be executed the generation for completing corpus labeling collection described in claim 1-8 any one as processor Method.