CN110209764B

CN110209764B - Corpus annotation set generation method and device, electronic equipment and storage medium

Info

Publication number: CN110209764B
Application number: CN201811048957.8A
Authority: CN
Inventors: 陆笛
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2023-04-07
Anticipated expiration: 2038-09-10
Also published as: WO2020052405A1; CN110209764A

Abstract

The invention discloses a corpus labeling set generation method and device, electronic equipment and a computer readable storage medium. According to the technical scheme provided by the invention, the corpus set to be labeled is obtained from the query log, the labeling results of the query sentences in the corpus set are obtained, the query sentences with similar labeling results are screened out, and the corpus labeling set is formed by the query sentences and the corresponding labeling results. Because the query sentences of the corpus tagging set belong to query sentences with similar multi-part tagging results, the tagging results of the query sentences in the corpus tagging set are less likely to diverge, the accuracy of the tagging results is higher, and the corpus tagging set with higher accuracy is used as a training set to train data analysis models such as an intention recognition model, so that the accuracy of the data analysis models can be improved.

Description

Corpus annotation set generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a corpus tagging set, an electronic device, and a computer-readable storage medium.

Background

In the field of voice interaction, query sentences input by users are mainly analyzed on line through various data analysis models, user intentions are recognized, and accurate responses are provided for the users. The data analysis model is obtained by training a large number of labeled query sentences (training set for short). Therefore, the accuracy of the labeling result of the query sentence in the training set directly influences the accuracy of the data analysis model and determines the intelligent level of the voice interaction function.

At present, query sentences are mainly labeled manually by a labeling person. For example, query intent (including chat intent, music on demand intent, weather query intent, etc.) of the query statement is noted. Therefore, the cognitive level of the annotator determines the annotation accuracy of the query statement.

Because the cognitive level of the annotating personnel may be different from the cognitive degree of a common person, or the cognition of a certain query sentence has deviation, the query sentence included in the training set is easily labeled inaccurately, and further, the data analysis model obtained by training has larger error, and accurate response cannot be provided for a user.

Disclosure of Invention

The invention provides a corpus annotation set generation method, aiming at solving the problem that in the related art, the annotation result of query sentences in a training set is inaccurate due to the fact that cognition of annotating personnel is deviated.

In one aspect, the present invention provides a method for generating a corpus tagging set, including:

acquiring a query log; the query log comprises query statements;

extracting query sentences to be labeled from the query log to obtain a corpus set to be labeled;

acquiring a labeling result of the multi-party concentrated query statement of the corpus to be labeled;

screening out query sentences with similar labeling results from the corpus to be labeled according to labeling results of multiple parties on the same query sentence;

and generating a corpus tagging set according to the query sentences with similar tagging results and the corresponding tagging results.

On the other hand, the present invention provides another apparatus for generating corpus annotation set, including:

the log acquisition module is used for acquiring the query log; the query log comprises query statements;

a corpus obtaining module, configured to extract query statements to be labeled from the query log to obtain a corpus to be labeled;

the result acquisition module is used for acquiring the labeling results of the query sentences in the corpus to be labeled in a multi-party manner;

the statement screening module is used for screening query statements with similar marking results from the corpus to be marked according to marking results of multiple parties on the same query statement;

and the labeling set generating module is used for generating a corpus labeling set according to the query sentences with similar labeling results and the corresponding labeling results.

Further, the present invention provides an electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

the processor is configured to execute the generating method of the corpus tagging set.

Further, the present invention provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program can be executed by a processor to complete the generating method of the corpus annotation set.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

according to the technical scheme provided by the invention, the corpus to be labeled is obtained from the query log, the labeling results of a plurality of users on the query sentences in the corpus are obtained, the query sentences with the same labeling results are screened out, and the query sentences and the corresponding labeling results form the corpus labeling set. Because the query sentences of the corpus tagging set belong to the query sentences with the same multi-part tagging results, the tagging results of the query sentences in the corpus tagging set are less likely to diverge, the accuracy of the tagging results is higher, and the corpus tagging set with higher accuracy is used as a training set to train data analysis models such as an intention recognition model, so that the accuracy of the data analysis models can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the present invention;

FIG. 2 is a block diagram illustrating a server in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method for generating corpus annotation sets in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram of the labeling results of various labeling tasks;

FIG. 5 is a schematic diagram illustrating a division principle of a corpus tagging set;

FIG. 6 is a schematic diagram of an influence curve of corpus tagging sets of each batch on model performance;

FIG. 7 is a detailed flowchart of step 330 in the corresponding embodiment of FIG. 3;

FIG. 8 is a schematic diagram illustrating the generation of corpus annotation sets in accordance with an exemplary embodiment;

FIG. 9 is a detailed flowchart of step 350 in the corresponding embodiment of FIG. 3;

FIG. 10 is a detailed flowchart of step 370 in a corresponding embodiment of FIG. 3;

FIG. 11 is a flowchart illustrating a method for generating corpus annotation sets according to the embodiment shown in FIG. 3;

FIG. 12 is a block diagram illustrating an apparatus for generating corpus annotation sets in accordance with an exemplary embodiment;

FIG. 13 is a block diagram illustrating the details of a corpus obtaining module in a corresponding embodiment of FIG. 13;

FIG. 14 is a block diagram of the details of the result acquisition module in the corresponding embodiment of FIG. 13;

FIG. 15 is a detailed block diagram of the statement screening module in the corresponding embodiment of FIG. 13.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a schematic diagram illustrating an implementation environment to which the present invention relates, according to an exemplary embodiment. The implementation environment to which the present invention relates includes a server 110. The query log is stored in the server 110, so that the server 110 can generate the corpus tagging set by using the query log by using the method for generating the corpus tagging set provided by the invention, thereby improving the accuracy of the query sentence tagging result in the corpus tagging set.

The implementation environment will also include a data source that provides data, i.e., query logs, as needed. Specifically, in this implementation environment, the data source may be the smart terminal 130. The server 110 may obtain the query log uploaded by the intelligent terminal 130, and then generate the corpus tagging set by using the method provided by the present invention. The intelligent terminal 130 may be a smart phone, a smart audio, or a tablet computer.

It should be noted that the corpus annotation set generation method according to the present invention is not limited to deploying corresponding processing logic in the server 110, and may also be deployed in other machines. For example, processing logic that generates corpus annotation sets, etc., is deployed in a terminal device with computing capabilities.

Referring to fig. 2, fig. 2 is a schematic diagram of a server structure according to an embodiment of the present invention. The server 200 may vary significantly depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 222 (e.g., one or more processors) and memory 232, one or more storage media 230 (e.g., one or more mass storage devices) storing applications 242 or data 244. Memory 232 and storage medium 230 may be, among other things, transient or persistent storage. The program stored in the storage medium 230 may include one or more modules (not shown), each of which may include a series of instruction operations in the server 200. Still further, the central processor 222 may be configured to communicate with the storage medium 230 to execute a series of instruction operations in the storage medium 230 on the server 200. The Server 200 may also include one or more power supplies 226, one or more wired or wireless network interfaces 250, one or more input-output interfaces 258, and/or one or more operating systems 241, such as a Windows Server ^TM ，Mac OS ^XTM ，UnixTM,Linux ^TM ，FreeBSD ^TM And so on. The steps performed by the server described in the embodiments of fig. 3, 7, 8, 10-12 below may be based on the server architecture shown in fig. 2.

It will be understood by those skilled in the art that all or part of the steps for implementing the embodiments described below may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk.

FIG. 3 is a flow diagram illustrating a method for generating a corpus annotation set, according to an exemplary embodiment. The applicable scope and execution subject of the corpus annotation set generation method may be a server, which may be the server 110 of the implementation environment shown in fig. 1. As shown in fig. 3, the method for generating the corpus annotation set may be performed by the server 110, and may include the following steps.

In step 310, a query log is obtained;

the query log refers to records collected by equipment, wherein the records are input by a user to query sentences, and the equipment can be an intelligent sound box, a mobile terminal and the like. The query log may include points in time, query statements entered by the user, query results returned to the user, and the like. The query sentence input by the user can be in a text or voice form. The query log may include a large number of query statements input by one or more users, so the query log may be said to be a corpus containing a large number of query statements. The raw corpus refers to query sentences belonging to original real users without manual labeling.

In step 330, extracting query statements to be labeled from the query log to obtain a corpus set to be labeled;

it should be noted that, because the query log includes a large number of query sentences, but not all query sentences are effective, some query sentences may be input by the user at will and do not represent any meaning, some query sentences may be too long or too short, and many query sentences may be repeated.

Therefore, the invention can extract the query statement to be labeled from the query log according to the preset strategy, and the query statement to be labeled forms the corpus to be labeled. The extraction of the query sentence to be labeled can be performed by analyzing the query log, removing the query sentence containing useless/disabled characters, removing meaningless query sentences (for example, randomly input characters without coherence), removing overlong or overlong query sentences or overlong query sentences, removing repeated query sentences, removing the labeled query sentence, and obtaining the last remaining query sentence as the query sentence to be labeled according to the configured useless/disabled character library.

In step 350, obtaining labeling results of the query sentences in the corpus to be labeled from multiple parties;

the multiple parties may be multiple annotators, multiple annotating devices, or multiple annotating programs in one device, and are used to indicate that multiple sources exist in the annotation result of the corpus query statement to be annotated. Each labeling party can label the query sentence in the corpus to be labeled (referred to as "voting"). The labeling means that a classification label is added to the query statement in the corpus to be labeled, and the correct classification of the query statement can be reflected only by a plurality of voting results.

The labeling result is the classification label added by the labeling party for the query statement. According to different labeling tasks, the labeling result can be an intention labeling result, an NER (Named Entity Recognition) labeling result, a slot labeling result or a word segmentation labeling result. The intention labeling result refers to an intention classification result, for example, "bad mood today", and the intention labeling result of the labeling party to the query statement is "chatting intention"; for example, "please get me a relaxing song," the annotator annotates the intent of the query statement to result in "music on demand intent.

The NER labeling result is to label out the name of a person, the name of a place, the name of a mechanism, a proper noun and the like in the query sentence. The slot labeling result refers to that slot labels are added to all phrases in the query sentence, for example, the slot labels are in the field of weather service, and have time words, place words, weather service keywords, weather phenomenon words, question words and the like. The word segmentation labeling result refers to that the query sentence is divided into a plurality of word groups, the word groups are used as word segmentation labeling results, and each word group can be regarded as a classification label.

As shown in fig. 4, for the corpus to be labeled, the labeling party may perform intention labeling, NER labeling, slot labeling, or participle labeling to obtain a labeling result of each labeling task. Specifically, each party may perform intent tagging on the query statement in the corpus to be tagged (according to the intent tagging document specification) to obtain an intent tagging set containing an intent tagging result of the query statement. And then, according to the intention labeling, the field of the query statement can be divided, NER labeling (according to NER labeling document specification) and slot position labeling (according to slot position labeling document specification) are carried out in the divided field at the same time, and an NER labeling set containing an NER labeling result and a slot position labeling set containing a slot position labeling result are obtained respectively. And when the intention labeling is carried out, each labeling party can also carry out word segmentation labeling on the corpus to be labeled so as to obtain a word segmentation labeling set containing word segmentation labeling results.

The intention labeling set, the slot labeling set, the NER labeling set or the participle labeling set can be stored in a storage medium of the server, and the server can obtain labeling results of query sentences in the multi-way corpus to be labeled from the storage medium.

In step 370, according to the labeling results of multiple parties to the same query statement, screening out query statements with similar labeling results from the corpus to be labeled;

the query sentences with similar labeling results refer to query sentences with consistent or similar multi-part labeling results, the similarity of the multi-part labeling results is greater than a preset value, the query sentences with similar labeling results can be regarded as the query sentences with similar labeling results, and the preset value can be 80% or 90%.

In an embodiment, assuming that the labeling result is an intention labeling result, the server obtains intention labeling results of query sentences in the multi-party corpus to be labeled, compares the intention labeling results of the query sentences in the multi-party corpus to be labeled with each query sentence in sequence, judges whether the labeling results of the query sentences in the multi-party corpus are consistent (the labeling results are consistent if the similarity of the labeling results is greater than a preset value), and further screens out the query sentences with consistent labeling results from the corpus to be labeled.

Specifically, for the same query statement, if the labeling results of multiple parties are consistent, a single label labeling set is added. If not, final adjudicator assistance is required to review the details of the inconsistency:

i) If more than half of the labeling parties are consistent in label, taking the labels with consistent multiple parties as the labeling results, and adding a single label labeling set;

ii) if the inconsistent result distribution is 1:1, there may be a multi-label case (the use case may label multiple labels). If the auditor audits and determines that the condition is a multi-label condition, adding a multi-label labeling set;

iii) If the multi-party labeling results are not consistent, the multi-label samples or the problematic samples are possible, and after the multi-label samples or the problematic samples are audited, the multi-label labeling sets or the problematic sample sets are added.

Thus, one annotation task finally obtains three annotation sets through the annotation process: single label set, multi label set and difficult sample set. The query statement in the single label labeling set can be regarded as the same labeling result of the query statement by multiple parties. The single label set can be regarded as a reliable label set and can be used as a training set, a test set and the like of an intention recognition model.

Similarly, assuming that the labeling result is an NER labeling result, a slot labeling result or a participle labeling result, the query sentences with the same labeling results can be screened out.

In step 390, a corpus tagging set is generated from the query sentences with similar tagging results and the corresponding tagging results.

The corpus tagging set includes a query statement and a tagging result corresponding to the query statement, where the query statement belongs to the query statement with similar multi-party tagging results screened in step 370. And the server generates a query corpus labeling set consisting of the query statement and the labeling result by utilizing the screened query statement with similar multi-party labeling result and the labeling result of the query statement.

As shown in fig. 5, for query sentences in the corpus to be labeled, a plurality of labeling parties label the corpus to be labeled, the server obtains a labeling result 1 of the query sentences in the corpus to be labeled by the labeling party 1, a labeling result 2 of the query sentences in the corpus to be labeled by the labeling party 2, a labeling result 3 of the query sentences in the corpus to be labeled by the labeling party 3, and a labeling result 4 of the query sentences in the corpus to be labeled by the labeling party 4. The server merges the labeling result 1, the labeling result 2, the labeling result 3 and the labeling result 4, screens out four query sentences with consistent labeling results and adds the four query sentences into a single label labeling set, if more than half of the labeling results of a certain query sentence are consistent, the labeling results of multiple parties can be considered to be the same, the labeling results with consistency of multiple persons are taken as the labeling results of the query sentences, the query sentences are added into the single label labeling set, the single label labeling set is taken as a corpus labeling set, and the single label labeling set can be merged with the labeled corpus set to be taken as a training set and a test set.

As shown in fig. 5, assuming that the labeling results 1, 2, 3, and 4 of some query statements are inconsistent and the inconsistent results are distributed as 1:1, these query statements may be in a multi-label case, and the query statements are added to the multi-label labeling set after being determined to be in the multi-label case through review by a reviewer. Given that the labeling results 1, 2, 3, 4 of some query statements are all inconsistent and may be multi-labeled samples or problematic samples, the query statements may be added to the multi-labeled label set or problematic label set.

It should be explained that the corpus labeling set is a single-label labeling set, and comprises query sentences with the same multi-party labeling result. That is to say, there is no divergence in the labeling result of the query sentence in the corpus labeling set, and the accuracy of the labeling result is high, so that the corpus labeling set can be used as a training set or a test set with high accuracy to train the data analysis model.

For example, assuming that the labeling result is an intention labeling result, and the corpus labeling set includes query sentences with the same intention labeling result and corresponding intention labeling results, the corpus labeling set may be used as a training set to perform training of the intention recognition model. Assuming that the labeling result is an NER labeling result, and the corpus labeling set includes query sentences with the same NER labeling result and NER labeling results corresponding to the query sentences, the corpus labeling set can be used as a training set for training the named entity recognition model. Similarly, assuming that the labeling result is a slot labeling result, the corpus labeling set can be used as a training set to train the slot labeling model, and assuming that the labeling result is a participle labeling result, the corpus labeling set can be used as a training set to train the participle labeling model.

According to the technical scheme provided by the above exemplary embodiment of the invention, the corpus to be labeled is obtained from the query log, the labeling results of the query sentences in the corpus by a plurality of users are obtained, the query sentences with the same labeling results are screened out, and the corpus labeling set is formed by the query sentences and the corresponding labeling results. Because the query sentences of the corpus tagging set belong to query sentences with similar multi-part tagging results, the tagging results of the query sentences in the corpus tagging set are less likely to diverge, the accuracy of the tagging results is higher, and the corpus tagging set with higher accuracy is used as a training set to train data analysis models such as an intention recognition model, so that the accuracy of the data analysis models can be improved.

According to the requirement, the corpus tagging set can be added into the existing training set in an incremental and superposed mode, the data analysis model is retrained, the performance of the data analysis model is tested by using the same test set, the effect improvement of the newly added corpus tagging set on the model performance is evaluated, and the quality and the value of the newly added corpus tagging set are reflected.

The corpus label set of intent is taken as an example. The performance results of the conscious recognition model on the test set are used as a benchmark. And then, adding the obtained speech material label set into a model training set, and recording the performance index of the model trained after each batch of data is added. The curve shown in fig. 6 records the performance of the model trained after the corpus annotation set of each batch is added, wherein the performance gain of the sixth batch (s 6) of data to the training model is obvious, and the corpus annotation set of the batch can be selected and added into the training data.

In addition, if the labeling party is a labeling person, in order to prevent the labeling person from cheating reference in the labeling process, a method of staggering labeling in each period can be adopted for labeling. As shown in table 1 below.

TABLE 1 task scheduling Table with staggered labels

	Day one	The next day	The third day	The fourth day	The fifth day
						Person 1	Document 1	Document 5	Document 4	Document 3	Document 2
Person 2	Document 2	Document 1	Document 5	Document 4	Document 3
						Person 3	Document 3	Document 2	Document 1	Document 5	Document 4
Person 4	Document 4	Document 3	Document 2	Document 1	Document 5

Taking the example of labeling by four labeling personnel, in order to prevent the labeling personnel from referring to the labeling results mutually, the contents of labeling by the labeling personnel on the same day are different, the labeling can be arranged according to the schedule table shown in table 1, the results are obtained by taking five days as a period, and the consistent and inconsistent labeling results among multiple people are counted.

In an exemplary embodiment, as shown in fig. 7, the step 330 specifically includes:

in step 331, removing query statements that do not satisfy preset conditions from the query log;

the query statement which does not satisfy the preset condition may include one or more of the following forms: the query sentences containing useless/disabled characters, meaningless query sentences, overlong or overlong query sentences, repeated query sentences and the like, so that the mark of the worthless query sentences is avoided, the workload is increased, and the accuracy of the corpus mark set is influenced.

In step 332, inputting the rest query statements in the query log into the constructed multiple label prediction models, and outputting the label prediction results of the multiple label prediction models on the same query statement; the label prediction models are obtained by training by adopting different training sample sets;

in particular, the tag prediction model may be an intent recognition model for recognizing the intent of the query statement. Accordingly, the tag prediction result may be an intention recognition result. The label prediction model may be trained using a large number of query statements (i.e., a training sample set) of known intent-to-label results. The plurality of label prediction models can be obtained by training different training samples. For example, all query sentences of which the intention labeling results are known are divided into 4 batches, and each batch of query sentences is trained to obtain a corresponding intention recognition model, so that 4 intention recognition models can be obtained.

After removing the query sentences which do not meet the requirements, respectively inputting the rest query sentences in the query log into 4 intention recognition models, and outputting the intention recognition results of the 4 intention recognition models on the same query sentence.

It should be noted that, according to different labeling tasks, the label prediction model may also be a named entity recognition model, a slot position labeling model, or a word segmentation model, and these models may be obtained through training a large number of query sentences of known NER labeling results, a large number of query sentences of known slot position labeling results, or a large number of query sentences of known word segmentation labeling results. Similarly, the label prediction result can be a corresponding named entity recognition result, a slot position labeling result and a word segmentation result. The construction mode of the label prediction model belongs to the prior art, and is not described herein again.

In step 333, according to the tag prediction results of the same query statement by the multiple tag prediction models, a query statement with inconsistent tag prediction results is screened from the remaining query statements, and the corpus to be labeled is obtained.

Because the boundary sample points have great significance on the boundary of the training model, if the sample points with multiple intents and certain probability distribution in different classes are found, the sample points are added into the training set for model training, and compared with the method that the sample points which can be accurately classified are added into the training set, the method has greater help on the performance improvement of the model.

According to the method, the query sentences with inconsistent label prediction results are screened from the rest query sentences in the query log according to the label prediction results of a plurality of label prediction models on the same query sentence. That is to say, the recognition accuracy of the model to the query sentences is low, so the query sentences can be regarded as boundary sample points, and the boundary sample points are added into the corpus to be labeled for model training, so that the accuracy of the model can be improved.

In an exemplary embodiment, the step 331 may include the steps of:

and classifying the query sentences recorded in the query log through the constructed classifier, and removing the nonsense query sentences obtained by classification.

The meaningless query statement refers to a statement without specific intention, and may be a statement input by a user by mistake or at will. Classifiers, i.e., classification models, function to identify query statements in a query log, which are meaningful and which are meaningless. Specifically, the classifier can be obtained by training a large number of meaningful query sentences and meaningless query sentences. For example, the classifier may be obtained by training parameters of a logistic regression model through a large number of meaningful query statements and meaningless query statements. The classifier is a general term of a method for classifying samples in data mining, and the construction mode of the classifier comprises algorithms such as decision trees, logistic regression, naive Bayes, neural networks and the like.

Specifically, the query statements in the query log can be input into a trained classifier, and a meaningful or meaningless judgment result is output, so that the meaningless query statements in the query log can be removed. Optionally, the query statement in the query log that contains the useless characters or the disabled characters may be removed according to the configured useless characters or disabled character library.

In another exemplary embodiment, the step 331 may further include the steps of:

and removing the labeled query sentences and the query sentences similar to the labeled query sentences in the query log according to the labeled query sentence set.

The labeled query statement set refers to a set of query statements with known labeling results. The set of annotated query statements may be a set of corpus annotations that have been generated. According to the query statements contained in the set of query statements, the query statements belonging to the set may be removed from the query statements of the query log. The labeled query statement refers to a query statement in the query statement set, and the query statement similar to the labeled query statement can find out the query statement with higher similarity to the labeled query statement in the query log by calculating the similarity between the query statements, so that the query statement with higher similarity to the labeled query statement in the query log is removed.

That is, the remaining query statements in step 332 may be the query statements remaining in the query log after removing meaningless query statements, removing query statements containing useless characters or disabled characters, removing labeled query statements, and removing query statements similar to the labeled query statements.

In another exemplary embodiment, the step 331 may further include the steps of:

and removing the query sentences only containing single entity words, the query sentences with the sentence length larger than the preset character number or repeated query sentences in the query log.

The entity words refer to names of real specific things, such as song names, singer names, and the like. The query sentence only containing one entity word is difficult to distinguish intentions, participles and the like, so that the query sentence is not suitable for being added with a corpus tagging set to participate in modeling. The query sentences with the sentence length larger than the preset number of characters are longer query sentences, the labeling difficulty of the query sentences is high, and the calculation amount is increased undoubtedly due to the longer query sentence length when the query sentences participate in modeling, so that the query sentences are not suitable for being added with a corpus labeling set to participate in modeling. Similarly, repeated query statements in the query log do not need to be added into the corpus tagging set to participate in modeling, so that the repeated query statements are removed, for example, three query statements are repeated, and 2 query statements can be removed and only one query statement is reserved.

In summary, the rest of the query sentences in the step 332 may also be the last rest of the query sentences in the query log after the query sentences only containing a single entity word, the query sentences with the sentence length larger than the preset number of characters, or the repeated query sentences are removed.

As shown in fig. 8, for the newly added query sentence, the newly added query sentence is preprocessed to remove the meaningless query sentence, remove the useless/disabled characters, remove the query sentence with a single entity word, remove the overlong and repeated query sentence, and according to the labeled query sentence set, remove the labeled query sentence, and remove the query sentence with high similarity to the labeled query sentence set, further, the query sentence with inconsistent label prediction result is screened out through the steps 332 and 333, and the screened query sentence forms the corpus to be labeled. And then, according to the labeling result of the corpus to be labeled in multiple directions, the query sentences with similar labeling results can be screened out to generate the corpus labeling set. Furthermore, the corpus label set can be added into the labeled query statement set to participate in the model training.

In an exemplary embodiment, as shown in fig. 9, the step 350 specifically includes:

in step 351, dispatching a labeling task for the corpus to be labeled to multiple parties, wherein the dispatching of the labeling task triggers the multiple parties to execute the labeling task in parallel;

the annotation task can be an intention annotation task, an NER annotation task, a slot annotation task or a word segmentation annotation task. For example, multiple parties may be multiple labeling devices, and the server issues a labeling task carrying a corpus to be labeled to the multiple labeling devices and triggers the multiple labeling devices to execute the labeling task in parallel. It should be noted that the annotation device may be an intelligent annotation device obtained by training a large amount of sample data in advance. Each marking device adopts different sample data sets for training, so the marking precision of each marking device is different.

In an embodiment, the server may issue a labeling task carrying a corpus to be labeled to a terminal device to which a plurality of labeling personnel belong. The terminal equipment to which the labeling personnel belongs can display the corpus set to be labeled and prompt the labeling task. The user can perform intention labeling, NER labeling, slot position labeling and word segmentation labeling by clicking options or drawing, and the terminal equipment to which a plurality of labeling personnel belong obtains labeling results according to the operation of clicking options or drawing by the user to complete the labeling task of the corpus to be labeled.

In an exemplary embodiment, the dispatching of the annotation task triggers multiple parties to execute the annotation task in parallel, and specifically includes: the dispatch of the labeling task triggers multi-party parallel input of the corpus to be labeled into a self-configured labeling model, and outputs respective labeling results of the corpus to be labeled; the labeling model configured by multiple parties is obtained by training different training sample sets.

That is, multiple parties may represent multiple annotating devices or multiple annotating programs herein. Each labeling party is configured with a labeling model, and the labeling models configured in multiple ways are obtained by training different training sample sets, so that multiple labeling devices or multiple labeling programs have different labeling precision. It should be noted that, in this embodiment, the training sample set used by the labeling model configured in multiple ways is different from the training sample set used by the label prediction model in the foregoing. For example, all samples can be divided into 10 training sample sets, each training sample set can be trained to obtain a corresponding model, then 10 models can be used, one part of the 10 models can be used as a label prediction model, the other part of the 10 models can be used as a labeling model, query sentences with inconsistent label prediction results are screened out by using a plurality of label prediction models to obtain a corpus set to be labeled, then labeling results of the query sentences in the corpus set to be labeled are calculated by using a plurality of labeling models, and labeling results of the query sentences in the corpus set to be labeled are obtained.

Assuming that multiple parties refer to multiple annotating programs deployed in a server, the multiple annotating programs can execute the following steps in parallel: and inputting the corpus to be annotated into a pre-constructed annotation model, and outputting an annotation result of the corpus to be annotated. The construction mode of the labeling model can refer to the construction of the label prediction model.

In step 352, the annotation result returned by the multi-party parallel execution of the annotation task is received.

And the server receives the labeling results returned by the plurality of labeling devices or the terminal devices to which the plurality of labeling personnel belong. Corresponding to the labeling task, the labeling result can be an intention labeling result, an NER labeling result, a slot position labeling result or a word segmentation labeling result.

In an exemplary embodiment, the corpus to be annotated comprises a plurality of buried point sentences of known tag information; the buried point statement refers to a query statement with a known accurate labeling result, and is distinguished from a labeling result of a multi-party buried point statement, the accurate labeling result of the buried point statement is called tag information, as shown in fig. 10, where the step 370 specifically includes:

in step 371, according to the labeling results of the multiple buried point sentences by multiple parties, comparing whether the labeling results of the multiple buried point sentences are consistent with the corresponding label information, and calculating the accuracy of the multiple party labeling results;

it should be noted that, when the corpus to be labeled is screened according to the multiparty labeling result, the accuracy of the labeling result of each labeling party needs to be determined first, so as to remove the labeling result provided by the labeling party with lower accuracy.

The accuracy of the multi-party labeling result refers to the accuracy of labeling a plurality of buried point sentences by each labeling party, and the calculation of the accuracy of labeling the buried point sentences is used for evaluating the labeling accuracy of the current labeling party. In the marking process, accuracy verification is carried out on each marking party in a 'point burying' mode. The query statement with 5% of consistency among multiple persons can be extracted from the last batch of labeled data set to serve as multiple buried point statements of the known label information of the current batch. For each labeling party, the ratio of the labeling result to the label information is calculated according to the labeling result of the labeling party to a plurality of embedded point sentences of the known label information and whether the labeling result of the plurality of embedded point sentences is consistent with the known label information, so that the accuracy of the labeling result of the labeling party is obtained.

In step 372, according to the accuracy of the multi-party labeling result, the labeling result sources with accuracy not up to the standard are removed from the multi-party sources.

Specifically, a threshold value may be set, and according to the accuracy of the labeling result of each user, a labeling party with an accuracy less than the threshold value may be considered as providing a labeling result with an accuracy not up to the standard. Therefore, the labeling results provided by the labeling parties with the accuracy rate smaller than the threshold value can be deleted.

Or, according to the accuracy of the labeling result of each labeling party, all labeling parties are ranked from high to low in accuracy, and a plurality of labeling parties ranked later can be regarded as labeling parties with unqualified accuracy. Therefore, the labeling results provided by the labeling parties with unqualified accuracy can be removed.

In step 373, query sentences with similar multi-source labeling results are screened out from the corpus to be labeled according to the labeling results of the rest sources.

The rest of the source labeling results refer to the labeling results of the corpus to be labeled of the rest of the labeling parties after the labeling parties with the accuracy rate not up to the standard are deleted from the labeling results provided by the multiple parties to provide the labeling results. That is to say, when query sentences with similar labeling results are subsequently screened out from the corpus to be labeled, the labeling results provided by the labeling parties which do not reach the standard are not needed. And screening out query sentences with similar labeling results of a plurality of labeling parties from the corpus to be labeled according to the labeling results of the corpus to be labeled of the labeling parties with higher residual accuracy.

In an exemplary embodiment, as shown in fig. 11, the method for generating a corpus annotation set provided in the present invention further includes:

in step 1101, according to the labeling results of multiple parties to the same query statement, the query statements with inconsistent labeling results are screened out from the corpus to be labeled;

it should be noted that the boundary samples greatly help model optimization and describe clearer classification boundaries. The boundary samples can be screened from samples with different directions for use. Specifically, the server can collectively screen out query sentences with inconsistent labeling results from multiple parties from the corpus to be labeled according to the labeling results of the multiple parties to the same query sentence.

In step 1102, obtaining multi-label query statements from the query statements with inconsistent labeling results, and obtaining boundary sample points for optimizing the data analysis model.

For a plurality of query sentences with inconsistent labeling results of users, multi-label query sentences (namely query sentences with a plurality of labeling results) can be obtained from the query sentences through audit of auditors, the multi-label query sentences can be regarded as boundary sample points, and the identification difficulty of the query sentences is high, so if the model can accurately identify the intention, the slot position and the like of the query sentences, the accuracy of the model is greatly improved. The data analysis model may be an intent recognition model, a named entity recognition model, a slot annotation model, a word segmentation model, and the like. The data analysis model is optimized through the query sentences, so that the identification accuracy of the model can be improved.

For example, "bad mood today, please give me a relaxing song. The intention of the query sentence includes chatting intention and music on demand intention, and the sample belongs to a boundary sample of intention classification and can help the model to train an accurate intention boundary.

The following is an embodiment of the apparatus of the present invention, which can be used to execute an embodiment of the method for generating the corpus annotation set executed by the server 110 according to the present invention. For details that are not disclosed in the embodiments of the present invention, please refer to the embodiments of the method for generating corpus annotation sets of the present invention.

Fig. 12 is a block diagram illustrating a corpus annotation set generation apparatus according to an exemplary embodiment, which may be used in the server 110 in the implementation environment shown in fig. 1 to perform all or part of the steps of the corpus annotation set generation method shown in any one of fig. 3 and 7 to 11. As shown in fig. 12, the apparatus includes, but is not limited to: a log obtaining module 1210, a corpus obtaining module 1230, a result obtaining module 1250, a statement screening module 1270, and an annotation set generating module 1290.

A log obtaining module 1210, configured to obtain a query log; the query log comprises query statements;

a corpus obtaining module 1230, configured to extract query statements to be labeled from the query log to obtain a corpus to be labeled;

a result obtaining module 1250, configured to obtain a labeling result of the query statement in the corpus to be labeled from multiple parties;

the statement screening module 1270 is used for screening query statements with similar marking results from the corpus to be marked according to the marking results of multiple parties on the same query statement;

and a labeling set generating module 1290, configured to generate a corpus labeling set from the query sentences similar to the labeling results and the corresponding labeling results.

The implementation process of the function and action of each module in the apparatus is detailed in the implementation process of the corresponding step in the method for generating the corpus tagging set, and is not described in detail here.

The log obtaining module 1210 can be, for example, one of the physical structures of the wired or wireless network interface 250 in fig. 2.

The corpus acquiring module 1230, the result acquiring module 1250, the sentence screening module 1270, and the annotation set generating module 1290 may also be functional modules, which are configured to execute corresponding steps in the method for generating a corpus annotation set. It is understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as programs stored in memory 232 for execution by central processor 222 of FIG. 2.

In an exemplary embodiment, as shown in fig. 13, the corpus obtaining module 1230 includes:

a statement removing unit 1231, configured to remove query statements that do not satisfy a preset condition in the query log;

a tag prediction unit 1232, configured to input the rest query statements in the query log into the constructed multiple tag prediction models, and output tag prediction results of the multiple tag prediction models for the same query statement; the label prediction models are obtained by training by adopting different training sample sets;

and a statement extracting unit 1233, configured to screen, according to the tag prediction results of the multiple tag prediction models for the same query statement, query statements with inconsistent tag prediction results from the remaining query statements, so as to obtain the corpus to be labeled.

In an exemplary embodiment, the statement removal unit 1231 includes:

and the classification removal subunit is used for classifying the query sentences recorded in the query log through the constructed classifier and removing the classified meaningless query sentences.

In an exemplary embodiment, the statement removal unit 1231 further includes:

and the first removal subunit is used for removing the labeled query sentences and the query sentences similar to the labeled query sentences in the query log according to the labeled query sentence set.

In an exemplary embodiment, the statement removal unit 1231 further includes:

and the second removal subunit is used for removing the query sentences only containing single entity words, the query sentences with the sentence length larger than the preset character number or repeated query sentences in the query log.

In an exemplary embodiment, as shown in fig. 14, the result obtaining module 1250 includes:

the task dispatching unit 1251 is configured to dispatch, to multiple parties, a labeling task for the corpus to be labeled, where the dispatch of the labeling task triggers the multiple parties to execute the labeling task in parallel;

and the result receiving unit 1252 is configured to receive an annotation result returned by the multi-party executing the annotation task in parallel.

The dispatch of the labeling task triggers multiple parties to execute the labeling task in parallel, and the dispatch of the labeling task comprises the following steps:

the dispatching of the labeling task triggers multi-party parallel inputting the corpus to be labeled into a self-configured labeling model, and outputs respective labeling results of the corpus to be labeled; the labeling model configured by multiple parties is obtained by training different training sample sets.

In an exemplary embodiment, the corpus to be labeled includes a plurality of buried point sentences of known tag information; as shown in fig. 15, the statement filtering module 1270 includes:

the accuracy rate calculation unit 1271 is configured to compare whether the labeling result of the multiple buried point sentences is consistent with the corresponding tag information according to the labeling result of the multiple buried point sentences by multiple parties, and calculate the accuracy rate of the multiple labeling result;

a source rejecting unit 1272, configured to reject, from multiple sources, a tagging result source with an accuracy that does not meet the standard according to the accuracy of the multiple tagging results;

and the sentence screening unit 1273 is configured to screen out query sentences with similar multi-source labeling results from the corpus to be labeled according to the labeling results of the remaining sources.

Optionally, the present invention further provides an electronic device, which may be used in the server 110 in the implementation environment shown in fig. 1 to execute all or part of the steps of the corpus annotation set generation method shown in any one of fig. 3 and fig. 7 to fig. 11. The electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the corpus annotation set generation method according to the above exemplary embodiment.

The specific manner in which the processor of the electronic device performs operations in this embodiment has been described in detail in the embodiment of the method for generating the corpus annotation set, and will not be described in detail here.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, that includes instructions. The storage medium stores a computer program, which can be executed by the central processing unit 222 of the server 200 to implement the method for generating the corpus annotation set.

It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for generating corpus annotation sets is characterized by comprising the following steps:

acquiring a query log; the query log comprises query statements;

extracting query sentences to be labeled from the query log to obtain a corpus set to be labeled; the corpus to be labeled comprises a plurality of buried point sentences of known label information;

acquiring a labeling result of the multi-party concentrated query statement of the corpus to be labeled; the labeling result is a classification label added by the multiple parties for the corpus centralized query statement to be labeled;

according to the labeling results of the multiple buried point sentences by multiple parties, comparing whether the labeling results of the multiple buried point sentences are consistent with the corresponding label information or not, and calculating the accuracy of the multiple labeling results;

according to the accuracy of the multi-party labeling result, removing a labeling result source with the accuracy not up to the standard from the multi-party sources;

screening query sentences with similar multi-source labeling results from the corpus to be labeled according to the labeling results of the rest sources;

2. The method according to claim 1, wherein the extracting query statements to be labeled from the query log to obtain a corpus to be labeled comprises:

removing the query sentences which do not meet the preset conditions in the query log;

inputting the rest query sentences in the query log into the constructed multiple label prediction models, and outputting the label prediction results of the multiple label prediction models on the same query sentence; the label prediction models are obtained by training by adopting different training sample sets;

and screening the query sentences with inconsistent label prediction results from the rest of the query sentences according to the label prediction results of the plurality of label prediction models on the same query sentence to obtain the corpus to be labeled.

3. The method of claim 2, wherein the removing the query statements in the query log that do not satisfy the preset condition comprises:

4. The method of claim 2, wherein the removing the query statements in the query log that do not satisfy the preset condition comprises:

5. The method of claim 2, wherein the removing the query statements in the query log that do not satisfy the preset condition comprises:

6. The method according to claim 1, wherein the obtaining of the labeling result of the query statement in the corpus to be labeled comprises:

dispatching the marking task of the corpus to be marked to multiple parties, and triggering the multiple parties to execute the marking task in parallel;

and receiving the labeling result returned by the multi-party parallel execution of the labeling task.

7. The method of claim 6, wherein the serving of the annotation task triggers multiple parties to execute the annotation task in parallel, comprising:

the dispatch of the labeling task triggers multi-party parallel input of the corpus to be labeled into a self-configured labeling model, and outputs respective labeling results of the corpus to be labeled; the labeling model configured by multiple parties is obtained by training different training sample sets.

8. A corpus annotation set generation device is characterized by comprising:

the log acquisition module is used for acquiring the query log; the query log includes query statements;

a corpus acquisition module, configured to extract query sentences to be labeled from the query log to obtain a corpus set to be labeled;

a labeling set generating module, configured to generate a corpus labeling set from query sentences with similar labeling results and corresponding labeling results;

the corpus to be labeled comprises a plurality of embedded point sentences of known label information; the statement screening module comprises:

the accuracy calculation unit is used for comparing whether the marking results of the multiple buried point sentences are consistent with the corresponding label information or not according to the marking results of the multiple buried point sentences by multiple parties, and calculating the accuracy of the multiple marking results;

the source rejecting unit is used for rejecting the labeling result source with the accuracy rate not reaching the standard from the multi-party source according to the accuracy rate of the multi-party labeling result;

and the statement screening unit is used for screening the query statements with similar multi-source labeling results from the corpus to be labeled according to the labeling results of the rest sources.

9. The apparatus of claim 8, wherein the corpus obtaining module comprises:

the statement removing unit is used for removing the query statements which do not meet the preset conditions in the query log;

the label prediction unit is used for inputting the rest query sentences in the query log into the constructed label prediction models and outputting the label prediction results of the label prediction models on the same query sentence; the label prediction models are obtained by training by adopting different training sample sets;

and the sentence extraction unit is used for screening the query sentences with inconsistent label prediction results from the rest of the query sentences according to the label prediction results of the plurality of label prediction models on the same query sentence to obtain the corpus to be labeled.

10. The apparatus according to claim 9, wherein the sentence removal unit comprises:

and the classification removal subunit is used for classifying the query statements recorded in the query log through the constructed classifier and removing the meaningless query statements obtained through classification.

11. The apparatus according to claim 9, wherein the sentence removal unit comprises:

and the first removal subunit is used for removing the marked query statement and the query statement similar to the marked query statement in the query log according to the marked query statement set.

12. The apparatus according to claim 9, wherein the sentence removal unit comprises:

13. The apparatus of claim 8, wherein the result obtaining module comprises:

the task dispatching unit is used for dispatching the marking tasks of the corpus to be marked to multiple parties, and the dispatching of the marking tasks triggers the multiple parties to execute the marking tasks in parallel;

and the result receiving unit is used for receiving the annotation result returned by the execution of the annotation task by multiple parties in parallel.

14. The apparatus according to claim 13, wherein the task dispatch unit is specifically configured to:

15. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method for generating corpus annotation sets according to any one of claims 1 to 7.

16. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executable by a processor to perform the method for generating the corpus annotation set according to any one of claims 1 to 7.