CN112395873B

CN112395873B - Method and device for generating white character labeling model and electronic equipment

Info

Publication number: CN112395873B
Application number: CN202011104779.3A
Authority: CN
Inventors: 王毅; 白洁; 潘政林
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-02-01
Anticipated expiration: 2040-10-15
Also published as: CN112395873A

Abstract

The application discloses a method and a device for generating a white character labeling model and electronic equipment, and relates to the technical field of artificial intelligence such as deep learning, natural language processing and voice. The specific implementation scheme is as follows: acquiring an initial data set comprising a text to be marked, a test set corresponding to the text to be marked and a training set; training the initial role labeling model based on a training set to generate a first labeling model; testing the first labeling model based on the test set test, and extracting incremental data from the text to be labeled when the test accuracy is smaller than a first threshold value; and expanding the training set by using the labeling data corresponding to the incremental data, and continuing training the first labeling model by using the expanded training set until the accuracy of the labeling model generated by training is greater than or equal to a first threshold value. Therefore, by the method for generating the white character labeling model, the scale of the labeling data is greatly reduced, and the labor cost and the time cost for training the white character labeling model are reduced.

Description

Method and device for generating white character labeling model and electronic equipment

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as deep learning, natural language processing and voice, and provides a method and a device for generating a white character labeling model and electronic equipment.

Background

With the development of AI (Artificial Intelligence) technology, the application of character tagging of white texts is becoming more and more extensive, for example, AI multi-role reading technology has been applied to multi-role talking-sound novels. The application of the technology needs to accurately and quickly identify the characters in the white text.

In the related art, the role labeling of the white text can be generally realized through a role labeling model based on deep learning. When the role labeling model is trained, labeling personnel are required to label a large amount of data randomly, and modeling and learning are carried out by using the labeled large amount of data to generate the role labeling model. However, generating a high-accuracy character labeling model often requires a very large scale of labeling data (on the order of millions to tens of millions), which increases the labor cost and time cost for training the white character labeling model and limits the large-scale application of white character labeling.

Disclosure of Invention

A method, an apparatus, an electronic device storage medium, and a computer program product for generating a white-character tagging model are provided.

According to an aspect of the present application, a method for generating a white character tagging model is provided, which includes: acquiring an initial data set, wherein the initial data set comprises a text to be labeled, a test set and a training set corresponding to the text to be labeled; training an initial role labeling model based on the training set to generate a first labeling model; testing the first labeling model based on the test set, and extracting incremental data from the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value; and expanding the training set by using the marking data corresponding to the incremental data, and continuing training the first marking model by using the expanded training set until the test accuracy of the marking model generated by training is greater than or equal to the first threshold value.

According to another aspect of the present application, there is provided an apparatus for generating a whitecharacter tagging model, including: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an initial data set, and the initial data set comprises a text to be labeled, a test set and a training set corresponding to the text to be labeled; the first training module is used for training the initial role labeling model based on the training set so as to generate a first labeling model; the test module is used for testing the first labeling model based on the test set, and extracting incremental data from the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value; and the second training module is used for expanding the training set by using the marking data corresponding to the incremental data so as to continue training the first marking model by using the expanded training set until the test accuracy of the marking model generated by training is greater than or equal to the first threshold value.

According to still another aspect of the present application, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of generating a white character tagging model as described above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of generating a white character annotation model as described above.

According to yet another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of generating a whitespace annotation model as described above.

According to the technical scheme, the problem that in the related technology, a large-scale marking data amount is often needed when a high-accuracy role marking model is generated, so that the labor cost and the time cost for training the white role marking model are increased, and the large-scale application of white role marking is limited is solved. Training an initial role labeling model through a training set containing a small amount of training data to generate a first labeling model, testing the first labeling model based on the testing set, extracting a small amount of incremental data which can improve the performance of the model most from a text to be labeled in the initial data set when the testing accuracy of the first labeling model does not meet the requirement, then expanding the training set according to the labeling data corresponding to the incremental data, and continuing training the first labeling model by using the expanded training set until the testing accuracy of the labeling model generated by training meets the requirement. Therefore, through an iterative training mode, according to the training result of each round of the labeling model, a small amount of incremental data which can improve the performance of the model most are selected from the data to be labeled for manual labeling so as to expand the training set, thereby greatly reducing the scale of the labeling data, reducing the labor cost and the time cost for the white character labeling model training, and promoting the large-scale application of the white character labeling.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a method for generating a white-character labeling model according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another method for generating a white-character labeling model according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for generating a white-character labeling model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a device for generating a white character tagging model according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device for implementing a method for generating a white-character labeling model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following briefly describes the technical field to which the solution of the present application relates:

artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware techniques generally include computer vision techniques, speech recognition techniques, natural language processing techniques, as well as machine learning/deep learning, big data processing techniques, knowledge-graph techniques, and the like.

Deep learning is a new research direction in the field of machine learning, and is introduced into machine learning to make it closer to the original target, artificial intelligence. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning has achieved many achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technologies, and other related fields.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will relate to natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics, but has important difference. Natural language processing is not a general study of natural language but is directed to the development of computer systems, and particularly software systems therein, that can efficiently implement natural language communications. It is thus part of computer science.

The embodiment of the application aims at the problems that in the related technology, a role labeling model with high accuracy is generated, and extremely large-scale labeling data amount is often needed, so that the labor cost and the time cost for labeling white roles are increased, and the large-scale application of the white role labeling is limited, and provides a method for generating the white role labeling model.

The following describes a method, an apparatus, an electronic device, a storage medium, and a computer program product for generating a white-character labeling model provided in the present application in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method for generating a white-character labeling model according to an embodiment of the present application.

As shown in fig. 1, the method for generating a white character tagging model includes the following steps:

step 101, an initial data set is obtained, wherein the initial data set comprises a text to be labeled, a test set and a training set corresponding to the text to be labeled.

The method for generating a white character tagging model according to the embodiment of the present application may be executed by the apparatus for generating a white character tagging model according to the embodiment of the present application, and the apparatus for generating a white character tagging model according to the embodiment of the present application may be configured in any electronic device to execute the method for generating a white character tagging model according to the embodiment of the present application.

The text to be labeled may include, but is not limited to, novels, news, scripts and other literature works containing a large number of dialogues.

The test set is a set containing a plurality of pieces of test data which are extracted from the text to be labeled, subjected to manual labeling and used for testing the initial role labeling model.

The training set is a set containing a plurality of pieces of training data which are extracted from the text to be labeled, are labeled manually and are used for training the initial role labeling model.

In the embodiment of the application, articles and books (including but not limited to novels, news, scripts and other literary works) containing a large number of dialogue can be used as texts to be labeled, the dialogue texts to be labeled are extracted, a small number (such as 500) of dialogue sentences are selected as training sentences to be labeled manually, and the selected training sentences and corresponding labeling data are utilized to form a training set; and selecting a small number of dialogue sentences as test sentences for manual annotation, so as to form a test set by using the selected test sentences and the annotation data corresponding to the selected test sentences; and then, an initial data set is formed by the text to be marked, the test set and the training set.

And 102, training the initial role labeling model based on the training set to generate a first labeling model.

The initial role labeling model can be a pre-constructed deep learning model with natural language understanding capability and label classification capability. During actual use, a specific structure of the initial role marking model can be set up according to actual needs and specific application scenarios, and the embodiment of the application does not limit the structure.

In this embodiment of the application, each training sentence in the training set may be respectively input into the initial role labeling model, so as to perform role labeling on each training sentence, generate a labeled role corresponding to each training sentence, then determine a loss value according to a difference between the labeled role corresponding to each training sentence and the labeled data, update parameters of the initial role labeling model when the loss value is greater than a loss value threshold, repeat the training process using the updated initial role labeling model until the generated loss value is less than or equal to the loss value threshold, complete the training of the current round, and determine the role labeling model generated when the training of the current round is finished as the first labeling model.

And 103, testing the first labeling model based on the test set, and extracting incremental data from the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value.

In the embodiment of the present application, after the first annotation model is generated, the first annotation model may be tested by using a test set to determine whether the performance of the first annotation model meets the requirement, and then it is determined whether training of the first annotation model needs to be continued according to a test result of the first annotation model.

Specifically, each test statement in the test set may be input into the first annotation model to perform role annotation on each test statement, so as to generate an annotated role corresponding to each test statement, and then the test accuracy of the first annotation model is determined according to a difference between the annotated role corresponding to each test statement and the annotation data; and when the test accuracy of the first labeling model is larger than or equal to the first threshold value, determining that the test accuracy of the first labeling model meets the performance requirement, thereby finishing the training process of the labeling model and determining the first labeling model as the generated white character labeling model.

If the test accuracy of the first labeling model is smaller than the first threshold, it can be determined that the test accuracy of the first labeling model does not meet the performance requirement, so that the training process of the labeling model can be continued, a small number (for example, 1000) of unlabeled sentences can be extracted from the unlabeled text of the initial data set and used as incremental data, and the incremental data is used for further training the first labeling model.

As a possible implementation manner, when determining the test accuracy of the first annotation model, the ratio of the number of test statements having the same annotation role as the annotation data to the total number of test statements in the test set may be determined as the test accuracy of the first annotation model.

In the embodiment of the application, when the test accuracy of the first labeling model is determined to be smaller than a first threshold, a preset number of unlabeled sentences can be selected from the text to be labeled in a random mode to serve as incremental data; the first annotation model may also be used to predict each unmarked statement in the text to be annotated, and the data to be annotated with poor prediction results in a preset number is selected as the incremental data according to the prediction result of each unmarked statement by the first annotation model, which is not limited in the embodiment of the present application.

It should be noted that, in actual use, the manner of determining the test accuracy of the first labeling model, the specific value of the first threshold, and the number of the incremental data extracted each time may be preset according to actual needs and specific application scenarios, which is not limited in the embodiment of the present application.

And 104, expanding the training set by using the marking data corresponding to the incremental data, and continuing training the first marking model by using the expanded training set until the test accuracy of the marking model generated by training is greater than or equal to the first threshold value.

In the embodiment of the application, after the incremental data is extracted from the text to be labeled, the incremental data can be labeled by a labeling person to generate labeled data corresponding to the incremental data, and then the incremental data and the labeled data corresponding to the incremental data are added into the training set to expand the training set. Then, each training sentence in the expanded training set can be input into the first labeling model to perform role labeling on each training sentence, a labeled role corresponding to each training sentence is generated, then a loss value is determined according to the difference between the labeled role corresponding to each training sentence and the labeled data, when the loss value is larger than a loss value threshold value, the parameters of the first labeling model are updated, the training process is repeated by using the updated first labeling model until the generated loss value is smaller than or equal to the loss value threshold value, and then the training of the round is completed.

After the training of the current round is finished, the updated first labeling model is tested according to the method in step 103, when the test accuracy of the updated first labeling model is smaller than the first threshold, incremental data are continuously extracted from the text to be labeled, the training set is expanded by the incremental data and the corresponding labeling data, then the training process is repeated by the expanded training set, and the process of extracting the incremental data is continued until the test accuracy of the labeling model generated by the training is larger than or equal to the first threshold, so that the labeling model generated by the training meets the performance requirement, and the training process of the labeling model can be completed.

According to the technical scheme of the embodiment of the application, the initial role labeling model is trained through a training set containing a small amount of training data to generate a first labeling model, the first labeling model is tested based on the test set, so that when the test accuracy of the first labeling model does not meet requirements, a small amount of incremental data which can improve the performance of the model most can be extracted from the text to be labeled in the initial data set, then the training set is expanded according to the labeling data corresponding to the incremental data, the first labeling model continues to be trained through the expanded training set until the test accuracy of the labeling model generated through training meets the requirements. Therefore, through an iterative training mode, according to the training result of each round of the labeling model, a small amount of incremental data which can improve the performance of the model most are selected from the data to be labeled for manual labeling so as to expand the training set, thereby greatly reducing the scale of the labeling data, reducing the labor cost and the time cost for the white character labeling model training, and promoting the large-scale application of the white character labeling.

In a possible implementation form of the method, when the first labeling model performs role labeling, the confidence degrees corresponding to the labeling role and the labeling role can be simultaneously output, so that high-quality incremental data can be selected to expand a training set according to the labeling confidence degrees of all statements to be labeled in a text to be labeled of the first labeling model, the white role labeling model can be converged as soon as possible, and the training efficiency of the model is further improved.

The method for generating a white-character labeling model provided in the embodiment of the present application is further described below with reference to fig. 2.

Fig. 2 is a schematic flowchart of another method for generating a white-character labeling model according to an embodiment of the present application.

As shown in fig. 2, the method for generating a pair of white character labeling models includes the following steps:

step 201, an initial data set is obtained, wherein the initial data set includes a text to be labeled, a test set corresponding to the text to be labeled, and a training set.

Step 202, training the initial role labeling model based on the training set to generate a first labeling model.

The detailed implementation process and principle of the steps 201-202 can refer to the detailed description of the above embodiments, and are not described herein again.

And 203, testing the first labeling model based on the test set, and acquiring an unlabeled statement set in the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value.

In the embodiment of the present application, if the test accuracy of the first labeling model is smaller than the first threshold, it may be determined that incremental data needs to be obtained from the text to be labeled to expand the training set, so as to further train the first labeling model, and improve the performance of the first labeling model. However, the text to be labeled may be a coherent text with a small number of dialogue sentences removed, so that the text to be labeled may be segmented to determine the set of unlabelled sentences in the text to be labeled.

As a possible implementation manner, the to-be-labeled text may be segmented according to punctuation marks in the to-be-labeled file, and the dialogue text in each quotation mark is determined as an unlabeled sentence, so as to generate an unlabeled sentence set.

And 204, performing role labeling on each unmarked statement in the unmarked statement set by using the first labeling model to obtain the confidence of the labeled role corresponding to each unmarked statement.

In this embodiment of the present application, after the set of unlabeled sentences is obtained, each unlabeled sentence may be input into the first labeling model, so as to perform role labeling on each unlabeled sentence by using the first labeling model, and generate a confidence level between a labeled role and a labeled role corresponding to each unlabeled sentence.

And step 205, extracting incremental data from the unlabeled statement set according to the sequence of the confidence degrees from small to large.

In the embodiment of the application, the higher the confidence of the annotated role corresponding to the unlabeled statement is, the higher the accuracy of the first annotation model in role annotation of the unlabeled statement is; otherwise, the accuracy of the first labeling model for performing role labeling on the un-labeled sentence is lower. Compared with the unmarked sentences with high accuracy marked by the characters, the unmarked sentences with low accuracy marked by the characters are adopted to train the first marked model, so that the model learning effect is better, and the convergence speed is higher, so that the unmarked sentences can be sorted in an ascending order according to the confidence coefficient of the marked character corresponding to each unmarked sentence, and then the unmarked sentences with the preset number with the minimum confidence coefficient are selected as incremental data to ensure that the incremental data is high-quality data which can improve the accuracy of the model, thereby promoting the convergence of the model as soon as possible, and further improving the training efficiency of the model.

And step 206, expanding the training set by using the labeling data corresponding to the incremental data, and continuing training the first labeling model by using the expanded training set until the test accuracy of the labeling model generated by training is greater than or equal to the first threshold.

The detailed implementation process and principle of the step 206 may refer to the detailed description of the above embodiments, and are not described herein again.

According to the technical scheme of the embodiment of the application, an initial role labeling model is trained through a training set containing a small amount of training data to generate a first labeling model, the first labeling model is tested based on the testing set, when the testing accuracy of the first labeling model does not meet requirements, the first labeling model is used for carrying out role labeling on each unlabeled sentence in a text to be labeled to obtain the confidence coefficient of the labeled role corresponding to each unlabeled sentence, then a plurality of unlabeled sentences with the minimum confidence coefficients are extracted to serve as incremental data, then the training set is expanded according to the labeling data corresponding to the incremental data, and the first labeling model is continuously trained through the training set after the training until the testing accuracy of the labeled model generated by the training meets requirements. Therefore, through an iterative training mode, according to the training result of each round of the labeling model, a small amount of incremental data which can improve the performance of the model most is selected from the unlabeled sentences for manual labeling so as to expand the training set, thereby not only greatly reducing the scale of the labeled data, but also promoting the convergence of the white character labeling model as soon as possible, further improving the training efficiency of the model, and further reducing the labor cost and the time cost for the white character labeling model training.

In a possible implementation form of the method, factors such as the uncertainty of the role, the dispersibility of the role, the importance of the role in the text to be labeled and the like when the first labeling model labels the role of the unlabeled sentence can be comprehensively considered, and incremental data is selected from the sentence to be labeled, so that the quality of the incremental data is further improved, and the training efficiency of the model is further improved.

The method for generating a white-character labeling model provided in the embodiment of the present application is further described below with reference to fig. 3.

Fig. 3 is a schematic flowchart of a method for generating a white-character labeling model according to an embodiment of the present application.

As shown in fig. 3, the method for generating a pair of white character labeling models includes the following steps:

step 301, an initial data set is obtained, wherein the initial data set includes a text to be labeled, a test set corresponding to the text to be labeled, and a training set.

Step 302, training the initial role labeling model based on the training set to generate a first labeling model.

And 303, testing the first labeling model based on the test set, and acquiring an unmarked statement set in the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value.

The detailed implementation process and principle of the steps 301-303 can refer to the detailed description of the above embodiments, and are not described herein again.

Step 304, using the first labeling model to label the role of each unlabeled statement in the unlabeled statement set, so as to obtain the confidence of each candidate role corresponding to each unlabeled statement.

The candidate roles may refer to M roles with the maximum output confidence when the first annotation model performs role annotation on an unlabeled statement. Wherein M is a positive integer greater than 1.

In this embodiment of the application, when the first annotation model performs role annotation on an unlabeled statement, for an unlabeled statement, the first annotation model may output M candidate roles with the largest confidence degrees and confidence degrees corresponding to the M candidate roles, respectively, and may determine the candidate role with the largest confidence degree as the annotated role corresponding to the unlabeled statement.

And 305, determining the role dispersion degree of each un-labeled statement according to the confidence degree of each candidate role corresponding to each un-labeled statement.

The role dispersion degree of the unlabeled statement can be measured by the confidence degree of each candidate role corresponding to the unlabeled statement. Specifically, the closer the confidence degrees of the candidate roles corresponding to the unlabeled statement are, the higher the role dispersion degree of the unlabeled statement is; conversely, the smaller the degree of character dispersion of the unlabeled sentence. Moreover, the role discrete degree of the unmarked sentences can also reflect the reliability of the first marking model in role marking of the unmarked sentences; specifically, the greater the role dispersion degree of the unmarked sentences, the higher the reliability of the first marking model in role marking of the unmarked sentences; otherwise, the reliability of the role labeling of the un-labeled sentences by the first labeling model is lower.

In the embodiment of the application, the role dispersion degree of each un-labeled statement can be determined according to the confidence degree of each candidate role corresponding to the un-labeled statement, so that the role dispersion degree is used as a dimension to measure the accuracy of the first labeling model in role labeling of the un-labeled statement.

As a possible implementation manner, the step 305 may include:

extracting K reference confidence coefficients from each confidence coefficient corresponding to each unlabeled statement according to the sequence of the confidence coefficients from high to low, wherein K is a positive integer greater than 1;

normalizing the K reference confidence degrees to obtain K normalized confidence degrees;

and calculating the role discrete degree of each un-labeled statement according to the normalized K confidence degrees corresponding to each un-labeled statement.

In a possible implementation form of the embodiment of the application, for an unlabeled statement, K (K may be a positive integer greater than 1 and less than or equal to M) confidence degrees are selected as reference confidence degrees from the confidence degrees corresponding to M candidate roles of the unlabeled statement. Then, normalization processing is carried out on the K reference confidence degrees to determine the K normalized confidence degrees. Specifically, the normalization process may be performed by the formula (1) for reference confidence levels greater than K.

Wherein,

for the normalized i-th confidence, P_iIs the ith reference confidence, P_jThe reference confidence is the jth confidence, K is the number of the reference confidence, and i and j are the serial numbers of the reference confidence.

After the K reference confidence degrees are normalized, the role dispersion degree of the unlabeled statement can be calculated according to the normalized K confidence degrees corresponding to the unlabeled statement. As a possible implementation manner, the entropy values of the K normalized confidences may be determined as the role dispersion degree of the unlabeled sentence, that is, the role dispersion degree of the unlabeled sentence may be determined by formula (2).

Wherein D is the role discrete degree of the unlabeled statement,

and the normalized ith confidence coefficient of the un-labeled statement, K is the number of the reference confidence coefficients of the un-labeled statement, and i is the serial number of the confidence coefficients.

As another possible implementation manner, the role dispersion degree of the unlabeled statement may be determined according to any normalized confidence, so as to reduce the computational complexity. That is, in a possible implementation form of the embodiment of the present application, after the performing the normalization processing on the K reference confidence degrees to obtain the K normalized confidence degrees, the method may further include:

and calculating the role discrete degree of each un-labeled statement according to any confidence coefficient of the normalized K confidence coefficients corresponding to each un-labeled statement.

Optionally, after normalizing the K reference confidence degrees corresponding to the unlabeled statement, any confidence degree may be selected from the normalized K confidence degrees

And will be

And determining the role dispersion degree of the unlabeled sentences.

Further, when the K normalized confidence degrees are relatively close, the role dispersion degree of the unlabeled statement is determined by using any normalized confidence degree, so as to ensure the reliability of the determined role dispersion degree. That is, in a possible implementation form of the embodiment of the present application, before calculating the role dispersion degree of each unlabeled sentence according to any one confidence coefficient of the normalized K confidence coefficients corresponding to each unlabeled sentence, the method may further include:

and determining that the difference value between every two confidence degrees in the K normalized confidence degrees is smaller than a second threshold value.

In the embodiment of the application, when the K confidence degrees after normalization of the unlabeled statement are close to each other, the role dispersion degree of the statement to be labeled can be determined by adopting any confidence degree, so that the accuracy of the determined role dispersion degree is ensured. Therefore, before determining the role dispersion degree of the statement to be annotated, it may be determined whether the difference between every two confidence levels of the normalized K confidence levels is smaller than a second threshold value. If so, determining that the K normalized confidence degrees are relatively close, and determining the role discrete degree of the statement to be annotated by adopting any normalized confidence degree so as to simplify the calculation; if not, it can be determined that there may be a large difference between the normalized K confidence coefficients, and the entropy of the normalized K confidence coefficients can be determined as the role dispersion degree of the unlabeled statement, so as to ensure the accuracy of the role dispersion degree.

And step 306, determining the weight of each un-annotated sentence according to the role dispersion degree of each un-annotated sentence and the confidence degree of the corresponding annotated role.

The weight of the un-labeled sentences can reflect the accuracy of the first labeling model in performing role labeling on the un-labeled sentences and the importance degree of the un-labeled sentences in improving the accuracy of the first labeling model.

In the embodiment of the application, since the confidence of the annotated role corresponding to the unlabeled sentence and the role dispersion degree of the unlabeled sentence can both be used for measuring the accuracy of the first annotation model in role annotation of the unlabeled sentence, the weight of each unlabeled sentence can be determined according to the role dispersion degree of each unlabeled sentence and the confidence of the corresponding annotated role after the role dispersion degree of each unlabeled sentence and the confidence of the corresponding annotated role are determined. That is, S may be defined as F (C, D), where S is a weight of an unlabeled term, C is a confidence of a labeled character corresponding to the unlabeled term, D is a degree of character dispersion of the unlabeled term, and F is a function that operates on C and D.

It can be understood that, since the confidence of the annotated role corresponding to the unlabeled sentence is in a positive correlation with the role annotation accuracy of the first annotated model to the unlabeled sentence, and the role dispersion degree of the unlabeled sentence is in a negative correlation with the role annotation accuracy of the first annotated model to the unlabeled sentence, the function F can be defined as a decreasing function of the confidence C of the annotated role, and the function F can be defined as an increasing function of the role dispersion degree D, so that the weight S of the unlabeled sentence and the accuracy of the first annotated model in role annotation to the unlabeled sentence are in a negative correlation. For example, the function F (C, D) ═ 1-C × D may be defined.

Furthermore, because the more important the annotated role corresponding to the unlabeled statement is in the text to be annotated, the higher the importance degree of the unlabeled statement to the first annotation model training is, the weight of the unlabeled statement can be corrected according to the importance degree of the annotated role corresponding to the unlabeled statement, so as to further improve the rationality and accuracy of incremental data extraction. That is, in a possible implementation form of the embodiment of the present application, after the step 306, the method may further include:

acquiring a marked role corresponding to each unmarked statement and the occurrence frequency of each role in a text to be marked;

determining the weight of each role according to the occurrence frequency of each role;

and correcting the weight of each un-labeled statement according to the weight of the labeled role corresponding to each un-labeled statement.

In the embodiment of the application, as the more the occurrence times of the roles in the text to be annotated are, the higher the importance degree of the roles is, the occurrence times of each role in the text to be annotated can be counted, and the occurrence times of each role is determined as the weight of each role; or, the ratio of the occurrence frequency of each role to the occurrence frequency of all roles is respectively determined as the weight of each role, so that the weight of the role is in positive correlation with the occurrence frequency of the role.

As a possible implementation manner, after the weight of each role is determined, the weight of the labeled role corresponding to each unlabeled sentence can be determined, and the weight of the unlabeled sentence is corrected by using the weight of the labeled role corresponding to the unlabeled sentence. Optionally, S ═ F (C, D, I) may be defined, where S is a weight of an unlabeled sentence, C is a confidence of a labeled role corresponding to the unlabeled sentence, D is a role dispersion degree of the unlabeled sentence, I is a weight of a labeled role corresponding to the unlabeled sentence, and F is a function of operating C, D, I.

It can be understood that, since the confidence of the annotated role corresponding to the unlabeled sentence is in positive correlation with the accuracy of the first annotation model in annotating the role of the unlabeled sentence, the role dispersion degree of the un-labeled sentences and the role labeling accuracy of the first labeling model to the un-labeled sentences are in a negative correlation, the weight I of the labeled roles corresponding to the un-labeled sentences and the importance degree of the un-labeled sentences to the first labeling model are in a positive correlation, the function F can thus be defined as a decreasing function of the confidence C of the annotated character, as an increasing function of the degree of dispersion D of the character, and determining the function F as an increasing function of the weight of the annotated role, so that the weight S of the unlabeled statement and the accuracy of the first annotated model in role annotation of the unlabeled statement are in a negative correlation relationship, and the importance degree of the unlabeled statement in the first annotated model training is in a positive correlation relationship. For example, the function F (C, D, I) ═ 1-C × D × I can be defined.

And 307, extracting the incremental data from the marked statement set according to the sequence of the weights from large to small.

In the embodiment of the application, the lower the role labeling accuracy of the unlabeled sentences (i.e., the smaller the confidence of the labeled roles, the greater the role dispersion degree) and the higher the importance of the labeled roles (the greater the weight of the labeled roles), the higher the effect of the first labeling model training, and the better the accuracy improvement effect on the first labeling model, so that the unlabeled sentences with small confidence, large role dispersion degree and large weight of the labeled roles can be selected from the unlabeled sentences as incremental data. And because the weight of the unlabeled sentences is a decreasing function of the confidence coefficient of the labeled role and an increasing function of the role discrete degree and the weight of the labeled role, the unlabeled sentences can be sorted in a descending order according to the weight of each unlabeled sentence, and the preset number of unlabeled sentences with the maximum weight are extracted from the unlabeled sentences to serve as incremental data.

And 308, expanding the training set by using the marking data corresponding to the incremental data, and continuing training the first marking model by using the expanded training set until the test accuracy of the marking model generated by training is greater than or equal to the first threshold.

The detailed implementation process and principle of the step 308 may refer to the detailed description of the above embodiments, and are not described herein again.

According to the technical scheme of the embodiment of the application, the initial role labeling model is trained through a training set containing a small amount of training data, to generate a first annotation model, and testing the first annotation model based on the test set, so that when the test accuracy of the first annotation model does not meet the requirement, using a first labeling model to label each un-labeled sentence in the text to be labeled, determining the weight of each un-annotated sentence according to the parameters such as the confidence coefficient of the annotated role corresponding to each un-annotated sentence, the discrete degree of the role, the weight of the annotated role and the like, then a plurality of unlabeled sentences with the maximum weight are extracted as incremental data, and then the training set is expanded according to the labeled data corresponding to the incremental data, and continuing training the first labeling model by using the expanded training set until the test accuracy of the labeling model generated by training meets the requirement. Therefore, through an iterative training mode, according to the training result of each round of the labeling model, the factors such as the uncertainty of the character, the dispersity of the character, the importance of the character in the text to be labeled and the like when the labeling model performs character labeling on the unlabeled sentences are comprehensively considered, and high-quality incremental data are selected from the sentences to be labeled, so that the scale of the labeled data is greatly reduced, the quality of the incremental data and the training efficiency of the model are further improved, and the labor cost and the time cost of the white character labeling model training are further reduced.

In order to implement the above embodiments, the present application further provides a device for generating a white character tagging model.

Fig. 4 is a schematic structural diagram of a device for generating a white character tagging model according to an embodiment of the present application.

As shown in fig. 4, the apparatus 40 for generating a model for tagging a white character includes:

an obtaining module 41, configured to obtain an initial data set, where the initial data set includes a text to be labeled, a test set corresponding to the text to be labeled, and a training set;

a first training module 42, configured to train the initial character tagging model based on a training set to generate a first tagging model;

the test module 43 is configured to test the first annotation model based on the test set, and extract incremental data from the text to be annotated when the test accuracy of the first annotation model is smaller than a first threshold;

and the second training module 44 is configured to expand the training set by using the annotation data corresponding to the incremental data, so as to continue training the first annotation model by using the expanded training set until the test accuracy of the annotation model generated by training is greater than or equal to the first threshold.

In practical use, the apparatus for generating a white character tagging model provided in the embodiment of the present application may be configured in any electronic device to execute the method for generating a white character tagging model.

In a possible implementation form of the present application, the test module 43 includes:

the first acquisition unit is used for acquiring an unmarked statement set in a text to be marked;

the second obtaining unit is used for performing role labeling on each unmarked statement in the unmarked statement set by using the first labeling model so as to obtain the confidence coefficient of the labeled role corresponding to each unmarked statement;

and the first extraction unit is used for extracting the incremental data from the unlabeled statement set according to the sequence of the confidence coefficients from small to large.

Further, in another possible implementation form of the present application, the testing module 43 includes:

the third acquisition unit is used for acquiring an unmarked statement set in the text to be marked;

a fourth obtaining unit, configured to perform role labeling on each unmarked statement in the unmarked statement set by using the first labeling model, so as to obtain a confidence of each candidate role corresponding to each unmarked statement;

the first determining unit is used for determining the role discrete degree of each un-annotated statement according to the confidence degree of each candidate role corresponding to each un-annotated statement;

the second determining unit is used for determining the weight of each un-annotated sentence according to the role discrete degree of each un-annotated sentence and the confidence coefficient of the corresponding annotated role;

and the second extraction unit is used for extracting the incremental data from the marked statement set according to the sequence of the weights from large to small.

Further, in another possible implementation form of the present application, the test module 43 further includes:

a fifth obtaining unit, configured to obtain a tagged role corresponding to each unmarked statement and occurrence frequency of each role in a text to be tagged;

a third determining unit, configured to determine a weight of each role according to the number of occurrences of each role;

and the correcting unit is used for correcting the weight of each un-annotated sentence according to the weight of the annotated role corresponding to each un-annotated sentence.

Further, in another possible implementation form of the present application, the first determining unit includes:

the extracting subunit is used for extracting K reference confidence coefficients from each confidence coefficient corresponding to each unlabeled statement according to the sequence of the confidence coefficients from high to low, wherein K is a positive integer greater than 1;

the normalization subunit is used for performing normalization processing on the K reference confidence coefficients to obtain K normalized confidence coefficients;

and the first calculating subunit is used for calculating the role discrete degree of each un-annotated statement according to the normalized K confidence degrees corresponding to each un-annotated statement.

Further, in another possible implementation form of the present application, the first determining unit further includes:

and the second calculating subunit is used for calculating the role discrete degree of each un-annotated statement according to any confidence coefficient of the normalized K confidence coefficients corresponding to each un-annotated statement.

and the determining subunit is used for determining that the difference value between every two confidence degrees in the K normalized confidence degrees is smaller than a second threshold value.

It should be noted that the above explanation of the embodiment of the method for generating a white character annotation model shown in fig. 1, fig. 2, and fig. 3 is also applicable to the apparatus 40 for generating a white character annotation model in this embodiment, and will not be described again here.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application, illustrating a method for generating a white character tagging model. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each electronic device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for generating a white character annotation model provided by the present application. A non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the method for generating a whitelabel model provided herein.

Memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for generating a white character annotation model in the embodiments of the present application (e.g., acquisition module 41, first training module 42, testing module 43, and second training module 44 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 502, that is, implements the generation method of the white character tagging model in the above method embodiment.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the generation method of the whitemark model, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 502 optionally includes memory located remotely from the processor 501, and these remote memories may be connected via a network to an electronic device that performs the method of generating a whitewashing model. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for generating a white character tagging model may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the generation method of the whitespace labeling model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Network (LAN), Wide Area Network (WAN), Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the conventional physical host and VPS (Virtual Private Server) service.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for generating a white character labeling model comprises the following steps:

acquiring an initial data set, wherein the initial data set comprises a text to be labeled, a test set and a training set corresponding to the text to be labeled;

training an initial role labeling model based on the training set to generate a first labeling model;

testing the first labeling model based on the test set, and extracting incremental data from the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value;

expanding the training set by using the marking data corresponding to the incremental data, and continuing training the first marking model by using the expanded training set until the test accuracy of the marking model generated by training is greater than or equal to the first threshold;

the extracting of the incremental data from the text to be labeled comprises:

acquiring an unmarked statement set in the text to be marked;

performing role labeling on each unlabeled statement in the unlabeled statement set by using the first labeling model to obtain the confidence coefficient of each candidate role corresponding to each unlabeled statement;

determining the role discrete degree of each unlabeled statement according to the confidence degree of each candidate role corresponding to each unlabeled statement;

determining the weight of each un-annotated sentence according to the role discrete degree of each un-annotated sentence and the confidence coefficient of the corresponding annotated role;

and extracting the incremental data from the unmarked statement set according to the sequence of the weights from big to small.

2. The method of claim 1, wherein the extracting incremental data from the text to be labeled comprises:

acquiring an unmarked statement set in the text to be marked;

performing role labeling on each unmarked statement in the unmarked statement set by using the first labeling model to obtain a confidence coefficient of a labeled role corresponding to each unmarked statement;

and extracting the incremental data from the unmarked statement set according to the sequence of the confidence degrees from small to large.

3. The method of claim 1, wherein after said determining a weight for each of said unlabeled statements, further comprising:

acquiring the marked role corresponding to each unmarked statement and the occurrence frequency of each role in the text to be marked;

and correcting the weight of each un-annotated statement according to the weight of the annotated role corresponding to each un-annotated statement.

4. The method of claim 1, wherein the determining the degree of role dispersion of each of the unlabeled sentences according to the confidence of each of the candidate roles corresponding to each of the unlabeled sentences comprises:

and calculating the role discrete degree of each un-annotated statement according to the normalized K confidence degrees corresponding to each un-annotated statement.

5. The method of claim 4, wherein after said normalizing said K reference confidence levels to obtain normalized K confidence levels, further comprising:

and calculating the role discrete degree of each un-annotated statement according to any confidence coefficient of the normalized K confidence coefficients corresponding to each un-annotated statement.

6. The method of claim 5, wherein before said calculating the degree of role dispersion of each said unlabeled sentence according to any one of the normalized K confidence levels corresponding to each said unlabeled sentence, further comprising:

7. A generation apparatus for labeling a white character model, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an initial data set, and the initial data set comprises a text to be labeled, a test set and a training set corresponding to the text to be labeled;

the first training module is used for training the initial role labeling model based on the training set so as to generate a first labeling model;

the test module is used for testing the first labeling model based on the test set, and extracting incremental data from the text to be labeled under the condition that the test accuracy of the first labeling model is smaller than a first threshold value;

the second training module is used for expanding the training set by using the marking data corresponding to the incremental data so as to continue training the first marking model by using the expanded training set until the test accuracy of the marking model generated by training is greater than or equal to the first threshold value;

the test module comprises:

a third obtaining unit, configured to obtain an unlabeled sentence set in the text to be labeled;

a fourth obtaining unit, configured to perform role labeling on each un-labeled sentence in the un-labeled sentence set by using the first labeling model, so as to obtain a confidence of each candidate role corresponding to each un-labeled sentence;

and the second extraction unit is used for extracting the incremental data from the unmarked statement set according to the sequence of the weights from large to small.

8. The apparatus of claim 7, wherein the test module comprises:

the first acquisition unit is used for acquiring an unmarked statement set in the text to be marked;

a second obtaining unit, configured to perform role labeling on each unmarked statement in the unmarked statement set by using the first labeling model, so as to obtain a confidence level of a labeled role corresponding to each unmarked statement;

and the first extraction unit is used for extracting the incremental data from the unmarked statement set according to the sequence of the confidence coefficients from small to large.

9. The apparatus of claim 7, wherein the test module further comprises:

a fifth obtaining unit, configured to obtain a tagged role corresponding to each un-tagged sentence and occurrence frequency of each role in the text to be tagged;

10. The apparatus of claim 7, wherein the first determining unit comprises:

the normalization subunit is configured to perform normalization processing on the K reference confidence levels to obtain K normalized confidence levels;

and the first calculating subunit is configured to calculate the role dispersion degree of each unlabeled statement according to the normalized K confidence degrees corresponding to each unlabeled statement.

11. The apparatus of claim 10, wherein the first determining unit further comprises:

and the second calculating subunit is configured to calculate the role dispersion degree of each unlabeled statement according to any confidence coefficient of the normalized K confidence coefficients corresponding to each unlabeled statement.

12. The apparatus of claim 11, wherein the first determining unit further comprises:

and the determining subunit is used for determining that the difference value between every two confidence degrees in the normalized K confidence degrees is smaller than a second threshold value.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.