CN110209764A - The generation method and device of corpus labeling collection, electronic equipment, storage medium - Google Patents
The generation method and device of corpus labeling collection, electronic equipment, storage medium Download PDFInfo
- Publication number
- CN110209764A CN110209764A CN201811048957.8A CN201811048957A CN110209764A CN 110209764 A CN110209764 A CN 110209764A CN 201811048957 A CN201811048957 A CN 201811048957A CN 110209764 A CN110209764 A CN 110209764A
- Authority
- CN
- China
- Prior art keywords
- query statement
- annotation results
- corpus
- marked
- mark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000003860 storage Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 67
- 238000000605 extraction Methods 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 6
- 235000013399 edible fruits Nutrition 0.000 claims description 4
- 239000000463 material Substances 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 abstract description 14
- 238000010586 diagram Methods 0.000 description 14
- 238000012360 testing method Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000012550 audit Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 239000012141 concentrate Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Present invention discloses a kind of generation method of corpus labeling collection and device, electronic equipment, computer readable storage mediums.Technical solution provided by the invention, by obtaining corpus to be marked from inquiry log, it obtains in many ways to the annotation results of query statement in the corpus, filters out the similar query statement of annotation results, and then constitute corpus labeling collection by these query statements and its corresponding annotation results.Since the query statement of corpus labeling collection belongs to the similar query statement of multi-party annotation results, so the annotation results of corpus labeling Integrated query sentence are smaller a possibility that there are disagreements, the accuracy of annotation results is higher, and then the training of the Data Analysis Models such as intention assessment model is carried out using the higher corpus labeling collection of the accuracy as training set, the accuracy of Data Analysis Model can be improved.
Description
Technical field
The present invention relates to field of computer technology, in particular to the generation method and device, electronics of a kind of corpus labeling collection
Equipment, computer readable storage medium.
Background technique
In interactive voice field, mainly the query statement that user inputs is carried out by various Data Analysis Models online
Analysis, identification user are intended to, provide for user and accurately reply.And Data Analysis Model is by a large amount of inquiries to having marked
Sentence (abbreviation training set) is trained.So in training set query statement annotation results accuracy, directly affect
Data Analysis Model it is accurate, determine the intelligent level of voice interactive function.
Currently, mainly manually being marked by mark personnel to query statement.For example, marking out the inquiry of query statement
Be intended to (including chat intention, Music on Demand intention, weather lookup intention etc.).So the human-subject test of mark personnel determines
The mark accuracy of query statement.
Since the human-subject test of mark personnel may be different from the degree of awareness of ordinary person, or some query statement is recognized
Know that there are deviations, therefore be easy to the query statement for making training set be included mark inaccuracy, in turn results in the number that training obtains
It is larger according to analysis model error, it can not provide for user and accurately reply.
Summary of the invention
In order to solve to cause present in the relevant technologies due to the cognitive presence deviation for marking personnel to train Integrated query language
The problem of the annotation results inaccuracy of sentence, the present invention provides a kind of generation methods of corpus labeling collection.
On the one hand, the present invention provides a kind of generation methods of corpus labeling collection, comprising:
Obtain inquiry log;The inquiry log includes query statement;
The extraction that query statement to be marked is carried out from the inquiry log, obtains corpus to be marked;
It obtains in many ways to the annotation results of query statement in the corpus to be marked;
According to the annotation results in many ways to same query statement, annotation results phase is filtered out from the corpus to be marked
As query statement;
By the similar query statement of the annotation results and corresponding annotation results, corpus labeling collection is generated.
On the other hand, the present invention provides the generating means of another corpus labeling collection characterized by comprising
Log acquisition module, for obtaining inquiry log;The inquiry log includes query statement;
Corpus obtains module and obtains for carrying out the extraction of query statement to be marked from the inquiry log wait mark
Infuse corpus;
As a result module is obtained, for obtaining in many ways to the annotation results of query statement in the corpus to be marked;
Sentence screening module, for basis in many ways to the annotation results of same query statement, from the corpus to be marked
In filter out the similar query statement of annotation results;
Mark collection generation module, for generating by the similar query statement of the annotation results and corresponding annotation results
Corpus labeling collection.
Further, the present invention provides a kind of electronic equipment, the electronic equipment includes:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing the generation method of above-mentioned corpus labeling collection.
Further, the present invention provides a kind of computer readable storage medium, the computer readable storage medium is deposited
Computer program is contained, the computer program can be executed the generation method for completing above-mentioned corpus labeling collection by processor.
The technical solution that the embodiment of the present invention provides can include the following benefits:
Technical solution provided by the invention obtains multiple users couple by obtaining corpus to be marked from inquiry log
The annotation results of query statement in the corpus filter out the identical query statement of annotation results, and then by these query statements
And its corresponding annotation results constitute corpus labeling collection.Since to belong to multi-party annotation results identical for the query statement of corpus labeling collection
Query statement, so the annotation results of corpus labeling Integrated query sentence are smaller a possibility that there are disagreements, annotation results
Accuracy is higher, and then the higher corpus labeling collection of the accuracy is carried out the data such as intention assessment model as training set and is analyzed
The accuracy of Data Analysis Model can be improved in the training of model.
It should be understood that the above general description and the following detailed description are merely exemplary, this can not be limited
Invention.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention
Example, and in specification together principle for explaining the present invention.
Fig. 1 is the schematic diagram of related implementation environment according to the present invention;
Fig. 2 is a kind of block diagram of server shown according to an exemplary embodiment;
Fig. 3 is a kind of flow chart of the generation method of corpus labeling collection shown according to an exemplary embodiment;
Fig. 4 is the annotation results schematic diagram of a variety of mark tasks;
Fig. 5 is the division principle schematic diagram of a variety of corpus labeling collection;
The influence curve schematic diagram of the corpus labeling the set pair analysis model performance of each batch of Fig. 6;
Fig. 7 is the details flow chart of step 330 in Fig. 3 corresponding embodiment;
Fig. 8 is the generating principle schematic diagram of corpus labeling collection shown according to an exemplary embodiment;
Fig. 9 is the details flow chart of step 350 in Fig. 3 corresponding embodiment;
Figure 10 is the details flow chart of step 370 in Fig. 3 corresponding embodiment;
Figure 11 is a kind of flow chart of the generation method of corpus labeling collection on the basis of Fig. 3 corresponding embodiment;
Figure 12 is a kind of block diagram of the generating means of corpus labeling collection shown according to an exemplary embodiment;
Figure 13 is the details block diagram that corpus obtains module in Figure 13 corresponding embodiment;
Figure 14 is the details block diagram that result obtains module in Figure 13 corresponding embodiment;
Figure 15 is the details block diagram of sentence screening module in Figure 13 corresponding embodiment.
Specific embodiment
Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended
The example of device and method being described in detail in claims, some aspects of the invention are consistent.
Fig. 1 is implementation environment schematic diagram according to the present invention shown according to an exemplary embodiment.Involved by the present invention
And implementation environment include server 110.It is stored with inquiry log in server 110, so that server 110 can use this hair
The generation method of the corpus labeling collection of bright offer generates corpus labeling collection using inquiry log, improves corpus labeling Integrated query
The accuracy of sentence annotation results.
As needed, which will also include providing data, the i.e. data source of inquiry log.Specifically,
In this implementation environment, data source can be intelligent terminal 130.The inquiry that the available intelligent terminal 130 of server 110 uploads
Then log uses method provided by the invention, generate corpus labeling collection.Intelligent terminal 130 can be smart phone, intelligent sound
It rings, tablet computer.
It should be noted that the generation method of corpus labeling collection of the present invention, is not limited to dispose in server 110 corresponding
Logic is handled, the processing logic being deployed in other machines is also possible to.For example, in the terminal device for having computing capability
Deployment generates the processing logic etc. of corpus labeling collection.
Referring to fig. 2, Fig. 2 is a kind of server architecture schematic diagram provided in an embodiment of the present invention.The server 200 can be because matching
It sets or performance is different and generate bigger difference, may include one or more central processing units (central
Processing units, CPU) 222 (for example, one or more processors) and memory 232, one or more
Store the storage medium 230 (such as one or more mass memory units) of application program 242 or data 244.Wherein, it deposits
Reservoir 232 and storage medium 230 can be of short duration storage or persistent storage.The program for being stored in storage medium 230 may include
One or more modules (diagram is not shown), each module may include to the series of instructions operation in server 200.
Further, central processing unit 222 can be set to communicate with storage medium 230, execute storage medium on server 200
Series of instructions operation in 230.Server 200 can also include one or more power supplys 226, one or more
Wired or wireless network interface 250, one or more input/output interfaces 258, and/or, one or more operations
System 241, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTMEtc..Following Fig. 3, Fig. 7,
The step as performed by server described in Fig. 8, Figure 10-embodiment illustrated in fig. 12 can be based on the server shown in Fig. 2
Structure.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of following embodiments can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
Fig. 3 is a kind of flow chart of the generation method of corpus labeling collection shown according to an exemplary embodiment.The corpus
The scope of application and executing subject for marking the generation method of collection can be server, which, which can be, implements ring shown in Fig. 1
The server 110 in border.As shown in figure 3, the generation method of the corpus labeling collection can be executed by server 110, may include with
Lower step.
In the step 310, inquiry log is obtained;
Wherein, inquiry log refers to that equipment collects the record that user input query sentence is carried out, which can be
Intelligent sound, mobile terminal etc..Inquiry log may include time point, the query statement of user's input, looking into user's return
Ask result etc..Wherein, the query statement of user's input can be text or speech form.Inquiry log may include one or more
A large amount of query statements of a user's input, thus, inquiry log can regard the life corpus comprising a large amount of query statements as.
So-called raw corpus, refers to the query statement for belonging to original real user, without artificial mark.
In a step 330, the extraction that query statement to be marked is carried out from the inquiry log obtains corpus to be marked
Collection;
It should be noted that due in inquiry log include a large amount of query statements, but and not all query statement all
Effectively, some may be the not representative that arbitrarily inputs of user in all senses, some query statements may be too long or too short, also
Much query statements may is that duplicate, if can reduce using the annotation results of these query statements as corpus labeling collection
Corpus labeling concentrates the accuracy of annotation results, and then influences the data point for obtaining corpus labeling collection as training sample training
Analyse the accuracy of model.
The present invention can extract inquiry language to be marked according to pre-configured strategy from inquiry log as a result,
Sentence, constitutes corpus to be marked by query statement to be marked.Wherein, the extraction for carrying out query statement to be marked, can be pair
Inquiry log is analyzed, and according to useless/deactivated character repertoire of configuration, query statement of the removal comprising useless/deactivated character is gone
Except the inquiry language that meaningless query statement (such as several characters of the not continuity arbitrarily inputted), removal are too long or too short
Sentence, the duplicate query statement of removal, remove the query statement marked, obtain last remaining query statement and are used as wait mark
The query statement of note.
In step 350, it obtains in many ways to the annotation results of query statement in the corpus to be marked;
Wherein, multiple mark personnel, more tagging equipments be can be in many ways, it can be with multiple mark journeys in an equipment
Sequence, for indicating the annotation results of query statement in corpus to be marked, there are multiple sources, for ease of description, hereafter will mark
Personnel, tagging equipment or marking program are referred to as mark side.It each mark side can be to the query statement in corpus to be marked
It is labeled and (referred to as " votes ").Mark, which refers to, adds tag along sort for the query statement in corpus to be marked, multiple " to throw
Ticket " result just can reflect the correct classification of query statement.
Annotation results are exactly the tag along sort that mark side is query statement addition.According to the difference of mark task, mark knot
Fruit can be intended to annotation results, NER (Named Entity Recognition names Entity recognition) annotation results, slot position
Annotation results or participle annotation results.Wherein, it is intended that annotation results refer to intent classifier as a result, such as " today feels out of one's plate ",
Mark side is " chat and be intended to " to the intention annotation results of the query statement;Such as " please carrying out a song releived to me ", mark
Note side is " Music on Demand intention " to the intention annotation results of the query statement.
NER annotation results refer to the name marked out in query statement, place name, mechanism name, proper noun etc..Slot position mark
As a result refer to and add slot position label for each phrase in query statement, such as weather business scope, slot position label having time word,
Place word, weather business keyword, weather phenomenon word, interrogative etc..Participle annotation results refer to query statement is divided into it is more
A phrase, multiple phrases can regard a tag along sort as participle annotation results, each phrase.
As shown in figure 4, be directed to corpus to be marked, mark side can carry out intention mark, NER mark, slot position mark or
Participle mark obtains the annotation results of each mark task.Specifically, each party can be first to the inquiry in corpus to be marked
Sentence carries out intention mark (according to being intended to mark document specification), obtains the intention mark for being intended to annotation results comprising query statement
Collection.And then query statement can be carried out to field division according to mark is intended to, and carry out NER mark simultaneously in ready-portioned field
(marking document specification according to NER) and slot position mark (marking document specification according to slot position), respectively obtains comprising NER annotation results
NER mark collection and slot position comprising slot position annotation results mark collection.Wherein, while taking notice of icon note, each mark side is also
It can carry out carrying out participle mark to corpus to be marked, obtain the participle mark collection comprising participle annotation results.
Wherein, it is intended that mark collection, slot position mark collection, NER mark collection or participle mark collection can store depositing in server
In storage media, server can be from obtaining in many ways to the annotation results of query statement in corpus to be marked in storage medium.
In step 370, it according to the annotation results in many ways to same query statement, is screened from the corpus to be marked
The similar query statement of annotation results out;
Wherein, the similar query statement of annotation results refers to that multi-party annotation results are consistent or similar query statement, in many ways
The similarity of annotation results is greater than preset value, it is believed that be the similar query statement of annotation results, the preset value can with 80%,
90%.
In one embodiment, it is assumed that annotation results are to be intended to annotation results, and server is got in many ways to language to be marked
Expect the intention annotation results of Integrated query sentence, be successively directed to every query statement, compares in many ways to the meaning of this query statement
As a result, judgement is in many ways to the annotation results of this query statement, unanimously whether (similarity of annotation results is greater than default icon note
Value is i.e. it is believed that consistent), and then the consistent query statement of multi-party annotation results is filtered out from corpus to be marked.
Specifically, single label for labelling collection is added if multi-party annotation results are consistent for the same query statement.
If it is inconsistent, absolute decree personnel is needed to assist to examine inconsistent concrete condition:
I) it if it exceeds mark side's mark of half is consistent, takes multi-party consistent label as annotation results, single mark is added
Label mark collection;
Ii) if inconsistent results are distributed as 1:1, it may be possible to the case where multi-tag (use-case can mark multiple labels).
Auditor audits the case where determination is multi-tag, then multi-tag mark collection is added;
Iii) if multi-party annotation results are all inconsistent, it may be possible to which multi-tag sample or difficult sample after audit, are added
In multi-tag mark collection or difficult sample set.
In this way, a mark task finally obtains three mark collection: single label for labelling collection, multi-tag by mark process
Mark collection and difficult sample set.The query statement that single label for labelling is concentrated can consider the mark knot to the query statement in many ways
Fruit is identical.Single label for labelling collection, which can consider, belongs to reliable mark collection, can be used as training set, the test of intention assessment model
Collection etc..
Similarly, it is assumed that annotation results are NER annotation results, slot position annotation results or participle annotation results, can also be screened
The identical query statement of multi-party annotation results out.
In step 390, by the similar query statement of the annotation results and corresponding annotation results, corpus labeling is generated
Collection.
Wherein, corpus labeling collection includes query statement and its corresponding annotation results, wherein the query statement belongs to step
The 370 similar query statements of multi-party annotation results filtered out.The multi-party annotation results that server by utilizing filters out are similar to be looked into
Sentence and the annotation results of the query statement are ask, the inquiry corpus labeling being made of the query statement and annotation results is generated
Collection.
As shown in figure 5, corpus to be marked is marked by multiple mark sides for the query statement in corpus to be marked
Note, server obtain the annotation results 1 of query statement in 1 pair of mark side corpus to be marked, 2 pairs of mark side corpus to be marked
The annotation results 2 of middle query statement, the annotation results 3 of query statement in 3 pairs of mark side corpus to be marked, mark side 4 treats
Mark the annotation results 4 of query statement in corpus.Server ties annotation results 1, annotation results 2, annotation results 3, mark
Fruit 4 merges, and filters out the consistent query statement of four annotation results and single label for labelling collection is added, if some query statement
In the presence of being more than that half annotation results are consistent, it is also contemplated that multi-party annotation results are identical, taking the consistent annotation results of more people to be used as should
The annotation results of query statement, and single label for labelling collection is added in the query statement, using the list label for labelling collection as corpus mark
Note collection, can merge with the corpus marked, as training set, test set.
As shown in fig. 5, it is assumed that the annotation results 1,2,3,4 of certain query statements are inconsistent, and inconsistent results are distributed as
The case where 1:1, then these query statements are likely to be multi-tag, audits the case where determination is multi-tag by auditor, will
Multi-tag mark collection is added in these query statements.Assuming that the annotation results 1,2,3,4 of certain query statements are all inconsistent, having can
It can be multi-tag sample or difficult sample, then these query statements can be added to multi-tag mark collection or difficult mark collection.
It is to be understood that corpus labeling collection is exactly single label for labelling collection, included is that multi-party annotation results are identical
Query statement.That is, the annotation results of corpus labeling Integrated query sentence are not present disagreement, the accuracys of annotation results compared with
Height, thus, it is possible to carry out the training of Data Analysis Model using corpus labeling collection as the higher training set of accuracy or test set.
For example, it is assumed that annotation results are intended to annotation results, corpus labeling collection includes being intended to the identical inquiry of annotation results
Sentence and its corresponding intention annotation results, then corpus labeling collection can be used as the training that training set carries out intention assessment model.
Assuming that annotation results are NER annotation results, corpus labeling collection includes the identical query statement of NER annotation results and its corresponding
NER annotation results, then corpus labeling collection can be used as the training that training set is named entity recognition model.Similarly, it is assumed that mark
Note is the result is that slot position annotation results, then corpus labeling collection can be used as the training that training set carries out slot position marking model, it is assumed that mark
Note is the result is that participle annotation results, then corpus labeling collection can be used as the training that training set carries out participle marking model.
The technical solution that the above exemplary embodiments of the present invention provide, by obtaining corpus to be marked from inquiry log
Collection, obtains multiple users to the annotation results of query statement in the corpus, filters out the identical query statement of annotation results, into
And corpus labeling collection is constituted by these query statements and its corresponding annotation results.Since the query statement of corpus labeling collection belongs to
The multi-party similar query statement of annotation results, so the annotation results of corpus labeling Integrated query sentence a possibility that there are disagreements
Smaller, the accuracy of annotation results is higher, and then carries out intention knowledge using the higher corpus labeling collection of the accuracy as training set
The training of the Data Analysis Models such as other model, can be improved the accuracy of Data Analysis Model.
As needed, existing training set, re -training can be added in corpus labeling collection by way of increment, superposition
Data Analysis Model, and using the performance of the same test set test data analysis model, assess newly-increased corpus labeling collection pair
Model performance bring effect promoting reflects the quality and value of newly-increased corpus labeling collection.
By taking the corpus labeling collection of intention as an example.Results of property of the identification model on test set will be realized as benchmark.It
Afterwards, model training is added in the every batch of corpus labeling collection of acquisition to concentrate, records the model trained after every batch of data is added
Performance indicator.The performance for the model that the corpus labeling collection of each batch of curve record as shown in Figure 6 trains after being added,
In the 6th batch of (s6) data it is obvious to the performance gain of training pattern, can select the batch corpus labeling collection be added training number
In.
In addition, if mark side is mark personnel, personnel's mutually cheating reference in annotation process is marked in order to prevent, it can
It is marked in the method for taking the mark that is staggered in each period.As shown in table 1 below.
Table 1 is staggered the task calendar of mark
First day | Second day | Third day | 4th day | 5th day | |
Personnel 1 | Document 1 | Document 5 | Document 4 | Document 3 | Document 2 |
Personnel 2 | Document 2 | Document 1 | Document 5 | Document 4 | Document 3 |
Personnel 3 | Document 3 | Document 2 | Document 1 | Document 5 | Document 4 |
Personnel 4 | Document 4 | Document 3 | Document 2 | Document 1 | Document 5 |
It is labeled with four mark personnel as example, marks personnel in order to prevent and mutually refer to annotation results, mark people
The content that is labeled on the same day of member is different, can the mark of the planning chart arrangement according to table 1, obtained with five days for a cycle
It takes result and counts annotation results consistent and inconsistent between more people.
In a kind of exemplary embodiment, as shown in fig. 7, above-mentioned steps 330 specifically include:
In step 331, the query statement that preset condition is unsatisfactory in the inquiry log is removed;
Wherein, the query statement for being unsatisfactory for preset condition may include one or more of form: comprising useless/deactivated
The query statement of character, it is not intended to adopted query statement, too long or too short query statement, duplicate query statement etc., to keep away
Exempt from it is subsequent these nugatory query statements are labeled, both increase workload, also affect the standard of corpus labeling collection
True property.
In step 332, by query statement remaining in the inquiry log, the multiple Tag Estimation moulds constructed are inputted
Type exports multiple Tag Estimation models to the Tag Estimation result of same query statement;The multiple Tag Estimation model passes through
It is obtained using different training sample set training;
Specifically, Tag Estimation model can be the intention assessment model that query statement is intended to for identification.Correspondingly, mark
Label prediction result can be intention assessment result.Tag Estimation model can use the known a large amount of inquiry languages for being intended to annotation results
Sentence (i.e. training sample set) training obtains.Multiple Tag Estimation models can be obtained using different training sample training.For example,
The known all query statements for being intended to annotation results are divided into 4 batches, the training of every batch of query statement obtains corresponding intention assessment mould
Type, thus, it is possible to obtain 4 intention assessment models.
After removing above-mentioned undesirable query statement, query statement remaining in inquiry log is inputted 4 respectively
A intention assessment model exports 4 intention assessment models to the intention assessment result of same query statement.
It should be noted that according to mark task difference, Tag Estimation model be also possible to Named Entity Extraction Model,
Slot position marking model or participle model, these models can be obtained by the training of a large amount of query statements of known NER annotation results,
Know that a large amount of query statements training of slot position annotation results obtains, it is known that a large amount of query statements training for segmenting annotation results obtains.
Similarly, Tag Estimation result can be corresponding name Entity recognition result, slot position annotation results, word segmentation result.Tag Estimation
The building mode of model belongs to the prior art, and details are not described herein.
In step 333, according to the multiple Tag Estimation model to the Tag Estimation of same query statement as a result, from institute
It states and filters out the inconsistent query statement of Tag Estimation result in remaining query statement, obtain the corpus to be marked.
Since boundary sample point is significant for the boundary of training pattern, if finding more intentions, on different classes of
There is the sample point of certain probability distribution, training set is added in this kind of sample point and carries out model training, is had been able to compared to
Training set is added in accurate sample point of classifying, and helps the performance boost of model bigger.
The present invention is according to multiple Tag Estimation models to the Tag Estimation of same query statement as a result, from inquiry log residue
Query statement in filter out the inconsistent query statement of Tag Estimation result.That is, model is to these query statements
It identifies that accuracy is lower, so these query statements are regarded as boundary sample point, these boundary sample points is added wait mark
It infuses corpus and carries out model training, the accuracy of model can be improved.
In a kind of exemplary embodiment, above-mentioned steps 331 be may comprise steps of:
Classified by the classifier constructed to the query statement recorded in the inquiry log, and removes and classify
The meaningless query statement arrived.
Wherein, it is not intended to which adopted query statement refers to not specifically intended sentence, it may be possible to user's mistake or arbitrarily input
Sentence.Classifier i.e. disaggregated model, the effect of classifier are the query statements identified in inquiry log, which is intentional
Justice, which is meaningless.It specifically can be trained by a large amount of significant query statements and meaningless query statement
To classifier.For example, can be to by a large amount of significant query statements and meaningless query statement, training logic is returned
The parameter for returning model, obtains classifier.Classifier is the general designation for the method classified in data mining to sample, classifier
Building mode includes decision tree, logistic regression, naive Bayesian, neural network scheduling algorithm.
Specifically, the query statement in inquiry log can be inputted trained classifier, exports significant or be not intended to
The judgement of justice as a result, can remove meaningless query statement in inquiry log in turn.It optionally, can also be according to configured
Useless character or deactivated character repertoire remove the query statement in inquiry log comprising useless character or deactivated character.
In another exemplary embodiment, above-mentioned steps 331 can with the following steps are included:
According to the query statement set marked, remove the query statement marked in the inquiry log and with marked
Infuse the similar query statement of query statement.
Wherein, the query statement set marked refers to the set of the query statement of known annotation results.What is marked looks into
Asking sentence set can be the corpus labeling collection generated.It, can according to query statement included in the query statement set
To belong to the query statement in the set from removal in the query statement of inquiry log.It has marked query statement and has just referred to the inquiry
Query statement in sentence set, query statement similar with query statement has been marked can be by calculating between query statement
Similarity is found out in inquiry log and has marked the higher query statement of query statement similarity, to remove in inquiry log
With marked the higher query statement of query statement similarity.
That is, remaining query statement can be and remove meaningless inquiry in inquiry log in above-mentioned steps 332
The query statement and remove and marked that the query statement of sentence, removal comprising useless character or deactivated character, removal have marked
After infusing the similar query statement of query statement, remaining query statement in inquiry log.
In another exemplary embodiment, above-mentioned steps 331 can with the following steps are included:
Remove the query statement for only including single entity word in the inquiry log, sentence length is greater than preset characters quantity
Query statement or duplicate query statement.
Wherein, entity word refers to the title of true specific things, for example, song title, singer's name etc..It only include a reality
The query statement of pronouns, general term for nouns, numerals and measure words, it is difficult to distinguish and be intended to, segment etc., so being not suitable for addition corpus labeling collection participates in modeling.Sentence length
Query statement greater than preset characters quantity refers to longer query statement, and this kind of query statement mark difficulty is big, and participates in building
Since query statement length is longer when mould, calculation amount undoubtedly will increase, be also not suitable for addition corpus labeling collection participation as a result, and build
Mould.Likewise, duplicate query statement is also not necessarily to that the participation modeling of corpus labeling collection is added in inquiry log, so removal weight
Multiple query statement, such as three query statements repeat, and can remove 2 and only retain a query statement.
To sum up, remaining query statement can also be the inquiry eliminated only comprising single entity word in above-mentioned steps 332
After sentence, sentence length are greater than the query statement or duplicate query statement of preset characters quantity, finally remained in inquiry log
Remaining query statement.
As shown in figure 8, pre-processing for newly-increased query statement to newly-increased query statement, meaningless look into is removed
Ask sentence, remove it is useless/deactivate character, remove the query statement of single entity word, remove overlength, duplicate query statement, and
And the query statement marked, removal and the query statement collection marked can be removed according to the query statement set marked
The very high query statement of similarity in conjunction, further, 332 and 333 to filter out Tag Estimation result different through the above steps
The query statement of cause, the query statement filtered out constitute corpus to be marked.In turn, according in many ways to corpus to be marked
Annotation results can filter out the similar query statement of annotation results and generate corpus labeling collection.And then corpus labeling collection can add
Enter the query statement set marked, participates in the training of model together.
In a kind of exemplary embodiment, as shown in figure 9, above-mentioned steps 350 specifically include:
In step 351, Xiang Duofang distributes the mark task to the corpus to be marked, the group of the mark task
Hair, triggering execute the mark task parallel in many ways;
Wherein, mark task can be intended to mark task, NER mark task, slot position mark task or participle mark times
Business.For example, more tagging equipments be can be in many ways, server Xiang Duotai tagging equipment, which issues, carries corpus to be marked
Mark task, more tagging equipments of triggering execute mark task parallel.Pass through in advance it should be noted that tagging equipment can be
The intelligent dimension equipment that the training of great amount of samples data obtains.Every tagging equipment is trained using different sample data sets,
So the mark precision of every tagging equipment is different.
In one embodiment, server can be issued to terminal device belonging to multiple mark personnel carries language to be marked
Expect the mark task of collection.The affiliated terminal device of mark personnel can carry out the displaying of collection corpus to be marked and proposing for mark task
Show.User can click option or draw take by way of carry out intention mark, NER mark, slot position mark and participle mark, it is more
A affiliated terminal device of mark personnel is according to user clicks on tabs or draws the operation acquisition annotation results taken, completes to language to be marked
Expect the mark task of collection.
In a kind of exemplary embodiment, above-mentioned mark task is distributed, and triggers multi-party parallel execution mark task, specifically
It include: distributing for the mark task, triggering is multi-party and is about to the marking model that the corpus to be marked inputs itself configuration,
Output is respectively to the annotation results of the corpus to be marked;Wherein, the marking model configured in many ways uses different training samples
This training is got.
That is, Duo Tai tagging equipment or multiple marking programs can be represented in many ways herein.Each mark side is configured with
Marking model, since the marking model configured in many ways is obtained using different training sample set training, so more tagging equipments
Or multiple marking programs have different mark precision.It should be noted that the marking model configured in many ways in the embodiment is adopted
Training sample set and the above Tag Estimation model training sample set used are also different.For example, can will own
Sample is divided into 10 training sample sets, and each training sample set passes through the available corresponding model of training, and then can be by 10
A model, a part are used as Tag Estimation model, and a part is used as marking model, utilizes multiple Tag Estimation model discrimination bids
The inconsistent query statement of prediction result is signed, corpus to be marked is obtained, is calculated later using multi-party marking model to be marked
The annotation results of query statement in corpus obtain in many ways to the annotation results of query statement in corpus to be marked.
Assuming that referring to the multiple marking programs disposed in server in many ways, multiple marking programs can execute following step parallel
It is rapid: corpus to be marked being inputted to the marking model constructed in advance, exports the annotation results to corpus to be marked.Marking model
Building mode be referred to the building of Tag Estimation model.
In step 352, it receives and executes the annotation results that the mark task returns parallel in many ways.
More tagging equipments or multiple affiliated terminal devices of mark personnel, the parallel mark task that executes obtain annotation results,
And annotation results are back to server, server receives more tagging equipments or the affiliated terminal device of multiple mark personnel returns
Annotation results.Corresponding with mark task, annotation results can be intended to annotation results, NER annotation results, slot position annotation results
Or participle annotation results.
In a kind of exemplary embodiment, corpus to be marked includes that a plurality of of known label information buries point statement;It buries a little
Sentence refers to the query statement of known accurate annotation results, is and distinguishes in many ways to the annotation results for burying point statement, buries a little
The accurate annotation results of sentence are known as label information, and as shown in Figure 10, above-mentioned steps 370 specifically include:
It is more described a plurality of to bury point statement according in many ways to a plurality of annotation results for burying point statement in step 371
Annotation results it is whether consistent with corresponding label information, the accuracy rate of multi-party annotation results is calculated;
It should be noted that needing first to judge every when screening corpus to be marked according to multi-party annotation results
The annotation results accuracy rate of a mark side, to remove the annotation results that the lower mark side of accuracy rate provides.
The accuracy rate of multi-party annotation results refers to that each mark side to a plurality of accuracy burying point statement and being labeled, passes through
The calculating of accuracy is marked for assessing the mark accuracy rate of current mark side to point statement is buried.In annotation process, using " burying
The mode of point " has carried out accuracy rate verification to each mark side.Wherein it is possible to be from the data set that last consignment of mark is completed
More than the 5% consistent query statement of people is extracted as a plurality of of present lot known label information and buries point statement.For each mark
Side, can be according to the mark side to a plurality of annotation results for burying point statement of known label information, and compares this and a plurality of bury a little
Whether the annotation results of sentence are consistent with known label information, calculate annotation results and the consistent accounting of label information, obtain
The annotation results accuracy rate of the mark side.
In step 372, according to the accuracy rate of the multi-party annotation results, it is not up to standard that accuracy rate is rejected from multi-party source
Annotation results source.
Specifically, can be with given threshold, according to the annotation results accuracy rate of each user, accuracy rate is less than the mark of threshold value
Side, which may be considered, provides accuracy rate annotation results not up to standard.It is possible thereby to delete the mark that this kind of accuracy rate is less than threshold value
The annotation results just provided.
Alternatively, according to the annotation results accuracy rate of each mark side, to all mark sides carry out accuracy rate from high to low into
Row sequence, several mark sides to sort rearward may be considered accuracy rate not mark side up to standard.Thus, it is possible to remove accuracy rate
The not annotation results that mark side up to standard provides.
In step 373, according to the annotation results in remaining source, multi-source mark is filtered out from the corpus to be marked
Infuse the similar query statement of result.
The annotation results in remaining source refer to from the annotation results provided in many ways, delete accuracy rate not mark side up to standard
After annotation results are provided, annotation results of the remaining mark side to corpus to be marked.That is to say, subsequent from corpus to be marked
When filtering out the similar query statement of multi-party annotation results, the annotation results that are not provided further according to mark side not up to standard.According to
The remaining higher mark side of accuracy rate filters out multiple marks to the annotation results of corpus to be marked from corpus to be marked
The similar query statement of square annotation results.
In a kind of exemplary embodiment, as shown in figure 11, the generation method of corpus labeling collection provided by the invention is also wrapped
It includes:
In step 1101, according to the annotation results in many ways to same query statement, sieved from the corpus to be marked
Select the inconsistent query statement of annotation results;
It should be noted that boundary sample is for model optimization, to depict clearer classification boundaries very useful.Its
In, boundary sample can screen use from multi-party inconsistent sample.Specifically, server can be according in many ways to same
The annotation results of one query statement filter out the inconsistent query statement of multi-party annotation results from corpus to be marked.
In step 1102, the query statement that multi-tag is obtained from the inconsistent query statement of the annotation results, is obtained
It must be used to carry out the boundary sample point of Data Analysis Model optimization.
For the inconsistent query statement of multiple user annotation results, more marks can therefrom be obtained by auditor's audit
The query statement (query statement that can have multiple annotation results) of label, the query statement of this kind of multi-tag may be considered side
The identification difficulty of boundary's sample point, this kind of query statement is larger, so if model can accurately identify the meaning of this kind of query statement
Figure, slot position etc. will greatly improve the accuracy rate of model.Data Analysis Model can be intention assessment model, name Entity recognition
Model, slot position marking model, participle model etc..The optimization that Data Analysis Model is carried out by this kind of query statement, can be improved
The recognition accuracy of model.
For example, " today feels out of one's plate, and please carrys out a song releived to me." query statement its be intended to contain the spare time
Merely it is intended to also contain Music on Demand intention, which belongs to the boundary property sample of intent classifier, and model training can be helped to go out standard
True intention boundary.
Following is apparatus of the present invention embodiment, can be used for executing the corpus labeling that the above-mentioned server 110 of the present invention executes
The generation method embodiment of collection.For undisclosed details in apparatus of the present invention embodiment, corpus labeling collection of the present invention is please referred to
Generation method embodiment.
Figure 12 is a kind of block diagram of the generating means of corpus labeling collection shown according to an exemplary embodiment, the corpus mark
Note collection generating means can be used in the server 110 of implementation environment shown in Fig. 1, execute Fig. 3, Fig. 7-Figure 11 it is any shown in
The all or part of step of the generation method of corpus labeling collection.As shown in figure 12, which includes but is not limited to: log acquisition
Module 1210, corpus obtain module 1230, result obtains module 1250, sentence screening module 1270 and mark collection generate mould
Block 1290.
Log acquisition module 1210, for obtaining inquiry log;The inquiry log includes query statement;
Corpus obtains module 1230, for carrying out the extraction of query statement to be marked from the inquiry log, obtains
Corpus to be marked;
As a result module 1250 is obtained, for obtaining in many ways to the annotation results of query statement in the corpus to be marked;
Sentence screening module 1270, for basis in many ways to the annotation results of same query statement, from the language to be marked
Material is concentrated and filters out the similar query statement of annotation results;
Mark collection generation module 1290, is used for by the similar query statement of the annotation results and corresponding annotation results,
Generate corpus labeling collection.
The function of modules and the realization process of effect are specifically detailed in the generation of above-mentioned corpus labeling collection in above-mentioned apparatus
The realization process of step is corresponded in method, details are not described herein.
Log acquisition module 1210 such as can be some physical structure wired or wireless network interface 250 in Fig. 2.
Corpus obtains module 1230, result obtains module 1250, sentence screening module 1270 and mark collection generate mould
Block 1290 is also possible to functional module, the correspondence step in generation method for executing above-mentioned corpus labeling collection.It is appreciated that
These modules can by hardware, software, or a combination of both realize.When realizing in hardware, these modules can be real
It applies as one or more hardware modules, such as one or more specific integrated circuits.When being realized with software mode, these modules
It may be embodied as the one or more computer programs executed on the one or more processors, such as the center reason device 222 of Fig. 2
The performed program being stored in memory 232.
In a kind of exemplary embodiment, as shown in figure 13, the corpus obtains module 1230 and includes:
Sentence removal unit 1231, for removing the query statement for being unsatisfactory for preset condition in the inquiry log;
Label prediction unit 1232, for by query statement remaining in the inquiry log, input to have constructed multiple
Tag Estimation model exports multiple Tag Estimation models to the Tag Estimation result of same query statement;The multiple label is pre-
Model is surveyed to obtain by using different training sample set training;
Sentence extraction unit 1233, for unit according to the multiple Tag Estimation model to the label of same query statement
Prediction result filters out the inconsistent query statement of Tag Estimation result from the remaining query statement, obtain it is described to
Mark corpus.
In a kind of exemplary embodiment, the sentence removal unit 1231 includes:
Classification removal subelement, for by the classifier that has constructed to the query statement recorded in the inquiry log into
Row classification, and remove the meaningless query statement that classification obtains.
In a kind of exemplary embodiment, the sentence removal unit 1231 further include:
First removal subelement, for removing and having been marked in the inquiry log according to the query statement set marked
Query statement and query statement similar with query statement has been marked.
In a kind of exemplary embodiment, the sentence removal unit 1231 further include:
Second removal subelement, for removing in the inquiry log only comprising query statement, the sentence of single entity word
Length is greater than the query statement or duplicate query statement of preset characters quantity.
In a kind of exemplary embodiment, as shown in figure 14, the result obtains module 1250 and includes:
Task dispatch unit 1251, for the mark task distributed in many ways to the corpus to be marked, the mark
Task distributes, and triggering executes the mark task parallel in many ways;
As a result receiving unit 1252, for receiving the annotation results for executing the mark task parallel in many ways and returning.
Wherein, distributing for task is marked, triggering executes the mark task parallel in many ways, comprising:
The mark task distributes, and triggering is multi-party and is about to the mark mould that the corpus to be marked inputs itself configuration
Type exports respectively to the annotation results of the corpus to be marked;Wherein, the marking model configured in many ways uses different training
Sample set training obtains.
In a kind of exemplary embodiment, the corpus to be marked includes that a plurality of of known label information buries point statement;
As shown in figure 15, the sentence screening module 1270 includes:
Accuracy rate computing unit 1271, for according in many ways to a plurality of annotation results for burying point statement, relatively described in
Whether a plurality of annotation results for burying point statement are consistent with corresponding label information, and the accuracy rate of multi-party annotation results is calculated;
Source culling unit 1272 is rejected quasi- for the accuracy rate according to the multi-party annotation results from multi-party source
True rate annotation results source not up to standard;
Sentence screening unit 1273 is screened from the corpus to be marked for the annotation results according to remaining source
The similar query statement of multi-source annotation results out.
Optionally, the present invention also provides a kind of electronic equipment, which can be used for the clothes of implementation environment shown in Fig. 1
Be engaged in device 110 in, execute Fig. 3, Fig. 7-Figure 11 it is any shown in corpus labeling collection generation method all or part of step.Institute
Stating electronic equipment includes:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing the generation side of corpus labeling collection described in the above exemplary embodiments
Method.
The processor of electronic equipment executes the concrete mode of operation in the related corpus labeling collection in the embodiment
Detailed description is performed in the embodiment of generation method, no detailed explanation will be given here.
In the exemplary embodiment, a kind of storage medium is additionally provided, which is computer readable storage medium,
It such as can be the provisional and non-transitorycomputer readable storage medium for including instruction.The storage medium is stored with computer
Program, the computer program can be executed by the central processing unit 222 of server 200 to complete the generation of above-mentioned corpus labeling collection
Method.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and
And various modifications and change can executed without departing from the scope.The scope of the present invention is limited only by the attached claims.
Claims (15)
1. a kind of generation method of corpus labeling collection characterized by comprising
Obtain inquiry log;The inquiry log includes query statement;
The extraction that query statement to be marked is carried out from the inquiry log, obtains corpus to be marked;
It obtains in many ways to the annotation results of query statement in the corpus to be marked;
According to the annotation results in many ways to same query statement, it is similar from the corpus to be marked to filter out annotation results
Query statement;
By the similar query statement of the annotation results and corresponding annotation results, corpus labeling collection is generated.
2. the method according to claim 1, wherein described carry out inquiry language to be marked from the inquiry log
The extraction of sentence, obtains corpus to be marked, comprising:
Remove the query statement that preset condition is unsatisfactory in the inquiry log;
By query statement remaining in the inquiry log, the multiple Tag Estimation models constructed are inputted, multiple labels are exported
Tag Estimation result of the prediction model to same query statement;The multiple Tag Estimation model is by using different training samples
This training is got;
According to the multiple Tag Estimation model to the Tag Estimation of same query statement as a result, from the remaining query statement
In filter out the inconsistent query statement of Tag Estimation result, obtain the corpus to be marked.
3. according to the method described in claim 2, it is characterized in that, being unsatisfactory for preset condition in the removal inquiry log
Query statement, comprising:
Classified by the classifier constructed to the query statement recorded in the inquiry log, and removes what classification obtained
Meaningless query statement.
4. according to the method described in claim 2, it is characterized in that, being unsatisfactory for preset condition in the removal inquiry log
Query statement, comprising:
According to the query statement set marked, removes the query statement marked in the inquiry log and looked into having marked
Ask the similar query statement of sentence.
5. according to the method described in claim 2, it is characterized in that, being unsatisfactory for preset condition in the removal inquiry log
Query statement, comprising:
Remove the query statement for only including single entity word in the inquiry log, sentence length is greater than looking into for preset characters quantity
Ask sentence or duplicate query statement.
6. the method according to claim 1, wherein the acquisition is to query statement in the corpus to be marked
Annotation results, comprising:
To the mark task distributed in many ways to the corpus to be marked, the mark task is distributed, and triggering is held parallel in many ways
The row mark task;
It receives and executes the annotation results that the mark task returns parallel in many ways.
7. according to the method described in claim 6, triggering is multi-party to be executed parallel it is characterized in that, the mark task distributes
The mark task, comprising:
The mark task distributes, and triggering is multi-party and is about to the marking model that the corpus to be marked inputs itself configuration,
Output is respectively to the annotation results of the corpus to be marked;Wherein, the marking model configured in many ways uses different training samples
This training is got.
8. the method according to claim 1, wherein the corpus to be marked includes the more of known label information
Item buries point statement;The basis to the annotation results of same query statement, screens bid from the corpus to be marked in many ways
Infuse the similar query statement of result, comprising:
According in many ways to a plurality of annotation results for burying point statement, a plurality of annotation results for burying point statement with it is corresponding
Whether label information is consistent, and the accuracy rate of multi-party annotation results is calculated;
According to the accuracy rate of the multi-party annotation results, accuracy rate annotation results source not up to standard is rejected from multi-party source;
According to the annotation results in remaining source, the similar inquiry of multi-source annotation results is filtered out from the corpus to be marked
Sentence.
9. a kind of generating means of corpus labeling collection characterized by comprising
Log acquisition module, for obtaining inquiry log;The inquiry log includes query statement;
Corpus obtains module and obtains language to be marked for carrying out the extraction of query statement to be marked from the inquiry log
Material collection;
As a result module is obtained, for obtaining in many ways to the annotation results of query statement in the corpus to be marked;
Sentence screening module, for being sieved from the corpus to be marked according in many ways to the annotation results of same query statement
Select the similar query statement of annotation results;
Mark collection generation module, for generating corpus by the similar query statement of the annotation results and corresponding annotation results
Mark collection.
10. device according to claim 9, which is characterized in that the corpus obtains module and includes:
Sentence removal unit, for removing the query statement for being unsatisfactory for preset condition in the inquiry log;
Label prediction unit, for inputting the multiple Tag Estimations constructed for query statement remaining in the inquiry log
Model exports multiple Tag Estimation models to the Tag Estimation result of same query statement;The multiple Tag Estimation model is logical
It crosses and is obtained using different training sample set training;
Sentence extraction unit, for unit according to the multiple Tag Estimation model to the Tag Estimation knot of same query statement
Fruit filters out the inconsistent query statement of Tag Estimation result from the remaining query statement, obtains the language to be marked
Material collection.
11. device according to claim 10, which is characterized in that the sentence removal unit includes:
Classification removal subelement, divides the query statement recorded in the inquiry log for the classifier by having constructed
Class, and remove the meaningless query statement that classification obtains.
12. device according to claim 9, which is characterized in that the result obtains module and includes:
Task dispatch unit, for the mark task distributed in many ways to the corpus to be marked, the group of the mark task
Hair, triggering execute the mark task parallel in many ways;
As a result receiving unit, for receiving the annotation results for executing the mark task parallel in many ways and returning.
13. device according to claim 9, which is characterized in that the corpus to be marked includes known label information
It is a plurality of to bury point statement;The sentence screening module includes:
Accuracy rate computing unit, it is more described a plurality of to bury a little for according in many ways to a plurality of annotation results for burying point statement
Whether the annotation results of sentence are consistent with corresponding label information, and the accuracy rate of multi-party annotation results is calculated;
Source culling unit is rejected accuracy rate from multi-party source and is not reached for the accuracy rate according to the multi-party annotation results
Target annotation results source;
Sentence screening unit filters out multi-source from the corpus to be marked for the annotation results according to remaining source
The similar query statement of annotation results.
14. a kind of electronic equipment, which is characterized in that the electronic equipment includes:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to perform claim requires the generation method of corpus labeling collection described in 1-8 any one.
15. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program can be executed the generation for completing corpus labeling collection described in claim 1-8 any one as processor
Method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811048957.8A CN110209764B (en) | 2018-09-10 | 2018-09-10 | Corpus annotation set generation method and device, electronic equipment and storage medium |
PCT/CN2019/100823 WO2020052405A1 (en) | 2018-09-10 | 2019-08-15 | Corpus annotation set generation method and apparatus, electronic device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811048957.8A CN110209764B (en) | 2018-09-10 | 2018-09-10 | Corpus annotation set generation method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110209764A true CN110209764A (en) | 2019-09-06 |
CN110209764B CN110209764B (en) | 2023-04-07 |
Family
ID=67779909
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811048957.8A Active CN110209764B (en) | 2018-09-10 | 2018-09-10 | Corpus annotation set generation method and device, electronic equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110209764B (en) |
WO (1) | WO2020052405A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675862A (en) * | 2019-09-25 | 2020-01-10 | 招商局金融科技有限公司 | Corpus acquisition method, electronic device and storage medium |
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111160044A (en) * | 2019-12-31 | 2020-05-15 | 出门问问信息科技有限公司 | Text-to-speech conversion method and device, terminal and computer readable storage medium |
CN111177412A (en) * | 2019-12-30 | 2020-05-19 | 成都信息工程大学 | Public logo bilingual parallel corpus system |
CN111179904A (en) * | 2019-12-31 | 2020-05-19 | 出门问问信息科技有限公司 | Mixed text-to-speech conversion method and device, terminal and computer readable storage medium |
CN111259134A (en) * | 2020-01-19 | 2020-06-09 | 出门问问信息科技有限公司 | Entity identification method, equipment and computer readable storage medium |
CN111785272A (en) * | 2020-06-16 | 2020-10-16 | 杭州云嘉云计算有限公司 | Online labeling method and system |
CN112163424A (en) * | 2020-09-17 | 2021-01-01 | 中国建设银行股份有限公司 | Data labeling method, device, equipment and medium |
CN112925910A (en) * | 2021-02-25 | 2021-06-08 | 中国平安人寿保险股份有限公司 | Method, device and equipment for assisting corpus labeling and computer storage medium |
CN113407713A (en) * | 2020-10-22 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Corpus mining method and apparatus based on active learning and electronic device |
WO2021238337A1 (en) * | 2020-05-29 | 2021-12-02 | 华为技术有限公司 | Method and device for entity tagging |
CN114078470A (en) * | 2020-08-17 | 2022-02-22 | 阿里巴巴集团控股有限公司 | Model processing method and device, and voice recognition method and device |
CN114757267A (en) * | 2022-03-25 | 2022-07-15 | 北京爱奇艺科技有限公司 | Method and device for identifying noise query, electronic equipment and readable storage medium |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642329B (en) * | 2020-04-27 | 2024-10-29 | 阿里巴巴集团控股有限公司 | Method and device for establishing term identification model, and method and device for term identification |
CN114025216B (en) * | 2020-04-30 | 2023-11-17 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
CN111629267B (en) * | 2020-04-30 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Audio labeling method, device, equipment and computer readable storage medium |
CN111611797B (en) * | 2020-05-22 | 2023-09-12 | 云知声智能科技股份有限公司 | Method, device and equipment for marking prediction data based on Albert model |
CN111651988B (en) * | 2020-06-03 | 2023-05-19 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for training model |
CN112052356B (en) * | 2020-08-14 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Multimedia classification method, apparatus and computer readable storage medium |
CN112541070B (en) * | 2020-12-25 | 2024-03-22 | 北京百度网讯科技有限公司 | Mining method and device for slot updating corpus, electronic equipment and storage medium |
CN112700763B (en) * | 2020-12-26 | 2024-04-16 | 中国科学技术大学 | Voice annotation quality evaluation method, device, equipment and storage medium |
CN112686022A (en) * | 2020-12-30 | 2021-04-20 | 平安普惠企业管理有限公司 | Method and device for detecting illegal corpus, computer equipment and storage medium |
CN113255879B (en) * | 2021-01-13 | 2024-05-24 | 深延科技(北京)有限公司 | Deep learning labeling method, system, computer equipment and storage medium |
CN113569546A (en) * | 2021-06-16 | 2021-10-29 | 上海淇玥信息技术有限公司 | Intention labeling method and device and electronic equipment |
CN113722289A (en) * | 2021-08-09 | 2021-11-30 | 杭萧钢构股份有限公司 | Method, device, electronic equipment and medium for constructing data service |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101105801A (en) * | 2007-04-20 | 2008-01-16 | 清华大学 | Automatic positioning method of network key resource page |
CN102541838A (en) * | 2010-12-24 | 2012-07-04 | 日电(中国)有限公司 | Method and equipment for optimizing emotional classifier |
CN103136210A (en) * | 2011-11-23 | 2013-06-05 | 北京百度网讯科技有限公司 | Method and device for mining query with similar requirements |
CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
CN105389340A (en) * | 2015-10-20 | 2016-03-09 | 北京云知声信息技术有限公司 | Information testing method and device |
CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
CN106372132A (en) * | 2016-08-25 | 2017-02-01 | 北京百度网讯科技有限公司 | Artificial intelligence-based query intention prediction method and apparatus |
CN107256267A (en) * | 2017-06-19 | 2017-10-17 | 北京百度网讯科技有限公司 | Querying method and device |
US20180095989A1 (en) * | 2016-09-30 | 2018-04-05 | Adobe Systems Incorporated | Document replication based on distributional semantics |
US20180196873A1 (en) * | 2017-01-11 | 2018-07-12 | Siemens Medical Solutions Usa, Inc. | Visualization framework based on document representation learning |
CN108334496A (en) * | 2018-01-30 | 2018-07-27 | 中国科学院自动化研究所 | Human-computer dialogue understanding method and system and relevant device for specific area |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20020045343A (en) * | 2000-12-08 | 2002-06-19 | 오길록 | Method of information generation and retrieval system based on a standardized Representation format of sentences structures and meanings |
CN105912724B (en) * | 2016-05-10 | 2019-04-26 | 黑龙江工程学院 | A kind of time-based microblogging file extent method towards microblogging retrieval |
-
2018
- 2018-09-10 CN CN201811048957.8A patent/CN110209764B/en active Active
-
2019
- 2019-08-15 WO PCT/CN2019/100823 patent/WO2020052405A1/en active Application Filing
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101105801A (en) * | 2007-04-20 | 2008-01-16 | 清华大学 | Automatic positioning method of network key resource page |
CN102541838A (en) * | 2010-12-24 | 2012-07-04 | 日电(中国)有限公司 | Method and equipment for optimizing emotional classifier |
CN103136210A (en) * | 2011-11-23 | 2013-06-05 | 北京百度网讯科技有限公司 | Method and device for mining query with similar requirements |
CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
CN105389340A (en) * | 2015-10-20 | 2016-03-09 | 北京云知声信息技术有限公司 | Information testing method and device |
CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
CN106372132A (en) * | 2016-08-25 | 2017-02-01 | 北京百度网讯科技有限公司 | Artificial intelligence-based query intention prediction method and apparatus |
US20180095989A1 (en) * | 2016-09-30 | 2018-04-05 | Adobe Systems Incorporated | Document replication based on distributional semantics |
US20180196873A1 (en) * | 2017-01-11 | 2018-07-12 | Siemens Medical Solutions Usa, Inc. | Visualization framework based on document representation learning |
CN107256267A (en) * | 2017-06-19 | 2017-10-17 | 北京百度网讯科技有限公司 | Querying method and device |
CN108334496A (en) * | 2018-01-30 | 2018-07-27 | 中国科学院自动化研究所 | Human-computer dialogue understanding method and system and relevant device for specific area |
Non-Patent Citations (1)
Title |
---|
XIAODONG ZHANG ETAL: "《A joint model of intent determination and slot filling for spoken language understanding》" * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675862A (en) * | 2019-09-25 | 2020-01-10 | 招商局金融科技有限公司 | Corpus acquisition method, electronic device and storage medium |
CN110852109A (en) * | 2019-11-11 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Corpus generating method, corpus generating device, and storage medium |
CN111177412A (en) * | 2019-12-30 | 2020-05-19 | 成都信息工程大学 | Public logo bilingual parallel corpus system |
CN111177412B (en) * | 2019-12-30 | 2023-03-31 | 成都信息工程大学 | Public logo bilingual parallel corpus system |
CN111160044A (en) * | 2019-12-31 | 2020-05-15 | 出门问问信息科技有限公司 | Text-to-speech conversion method and device, terminal and computer readable storage medium |
CN111179904A (en) * | 2019-12-31 | 2020-05-19 | 出门问问信息科技有限公司 | Mixed text-to-speech conversion method and device, terminal and computer readable storage medium |
CN111259134A (en) * | 2020-01-19 | 2020-06-09 | 出门问问信息科技有限公司 | Entity identification method, equipment and computer readable storage medium |
CN111259134B (en) * | 2020-01-19 | 2023-08-08 | 出门问问信息科技有限公司 | Entity identification method, equipment and computer readable storage medium |
CN113743117B (en) * | 2020-05-29 | 2024-04-09 | 华为技术有限公司 | Method and device for entity labeling |
WO2021238337A1 (en) * | 2020-05-29 | 2021-12-02 | 华为技术有限公司 | Method and device for entity tagging |
CN113743117A (en) * | 2020-05-29 | 2021-12-03 | 华为技术有限公司 | Method and device for entity marking |
CN111785272A (en) * | 2020-06-16 | 2020-10-16 | 杭州云嘉云计算有限公司 | Online labeling method and system |
CN114078470A (en) * | 2020-08-17 | 2022-02-22 | 阿里巴巴集团控股有限公司 | Model processing method and device, and voice recognition method and device |
CN112163424A (en) * | 2020-09-17 | 2021-01-01 | 中国建设银行股份有限公司 | Data labeling method, device, equipment and medium |
CN113407713A (en) * | 2020-10-22 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Corpus mining method and apparatus based on active learning and electronic device |
CN113407713B (en) * | 2020-10-22 | 2024-04-05 | 腾讯科技(深圳)有限公司 | Corpus mining method and device based on active learning and electronic equipment |
CN112925910A (en) * | 2021-02-25 | 2021-06-08 | 中国平安人寿保险股份有限公司 | Method, device and equipment for assisting corpus labeling and computer storage medium |
CN114757267A (en) * | 2022-03-25 | 2022-07-15 | 北京爱奇艺科技有限公司 | Method and device for identifying noise query, electronic equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020052405A1 (en) | 2020-03-19 |
CN110209764B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110209764A (en) | The generation method and device of corpus labeling collection, electronic equipment, storage medium | |
CN111291570B (en) | Method and device for realizing element identification in judicial documents | |
CN107491432B (en) | Low-quality article identification method and device based on artificial intelligence, equipment and medium | |
CN110263157B (en) | Data risk prediction method, device and equipment | |
CN110968695A (en) | Intelligent labeling method, device and platform based on active learning of weak supervision technology | |
WO2019043379A1 (en) | Fact checking | |
CN114238573A (en) | Information pushing method and device based on text countermeasure sample | |
CN114648392B (en) | Product recommendation method and device based on user portrait, electronic equipment and medium | |
CN104794212A (en) | Context sentiment classification method and system based on user comment text | |
CN104090888A (en) | Method and device for analyzing user behavior data | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN109800354B (en) | Resume modification intention identification method and system based on block chain storage | |
Vysotska et al. | The commercial content digest formation and distributional process | |
CN111462752A (en) | Client intention identification method based on attention mechanism, feature embedding and BI-L STM | |
CN106537387B (en) | Retrieval/storage image associated with event | |
CN114547346B (en) | Knowledge graph construction method and device, electronic equipment and storage medium | |
CN102402717A (en) | Data analysis facility and method | |
CN117474507A (en) | Intelligent recruitment matching method and system based on big data application technology | |
CN116976321A (en) | Text processing method, apparatus, computer device, storage medium, and program product | |
CN114372532A (en) | Method, device, equipment, medium and product for determining label marking quality | |
CN109933784B (en) | Text recognition method and device | |
CN116362534A (en) | Emergency management method and system for violations and risks of online customer service contents in railway field | |
CN115221323A (en) | Cold start processing method, device, equipment and medium based on intention recognition model | |
CN117933260A (en) | Text quality analysis method, device, equipment and storage medium | |
CN113407718A (en) | Method and device for generating question bank, computer readable storage medium and processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |