CN116150376A

CN116150376A - Sample data distribution optimization method, device and storage medium

Info

Publication number: CN116150376A
Application number: CN202310204314.2A
Authority: CN
Inventors: 毛宇; 黄凯; 徐伟; 林昊; 邬稳; 邓文强
Original assignee: Merchants Union Consumer Finance Co Ltd
Current assignee: Merchants Union Consumer Finance Co Ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-23

Abstract

The embodiment of the application provides a sample data distribution optimization method, a sample data distribution optimization device and a storage medium. The method comprises an end-to-end optimization method, wherein target keywords with high fitting risk are adaptively searched, and the target keywords are used as optimization targets to perform downstream negative sample sampling tasks; secondly, a calculation method of improved word frequency-inverse document frequency of the keyword co-occurrence relationship in the multi-intention corpus and between the classes is also constructed so as to clearly express the keyword co-occurrence relationship in the multi-intention corpus and between the classes; finally, a matched negative sample sampling method is obtained through processing the target keywords and filtering logic of a new training text containing the target keywords, and the training text distribution is optimized by adding a new negative sample corpus. By adopting the embodiment of the application, the whole modeling can be subjected to fitting optimization at the data source.

Description

Sample data distribution optimization method, device and storage medium

Technical Field

The present disclosure relates to the field of the internet, and in particular, to a method, an apparatus, and a storage medium for optimizing sample data distribution.

Background

Text is one of the most important information carriers today, and can enter the network through various social platforms, news media, and the like. The text information has various formats, subjects and contents, and has various lengths, so that how to reasonably apply and process the text information is an urgent need. Text classification is a very important task in intention recognition, and the application scene is very wide.

In practical applications, the model applied to the multi-classification model for intent recognition may be too complex in itself, so that noise in the training sample set is fitted, training samples are too few or lack of representativeness or interference of noise in the training samples, resulting in the model fitting these noise, which may result in positive samples of some types with large data volume being over-fitted, thereby generating too many types of content in the prediction result of the model, and reducing the richness of the ranking or classification effect of other types of content.

In general, solving the model overfitting problem is done from two aspects: the first aspect is to optimize the model network architecture, such as adding regularization, reducing model training parameters, optimizing training rounds, etc.; the second aspect is to optimize training data, such as labeling positive and negative proportions of training samples, labeling richness of text semantics of the data, and the like.

If the method is optimized from the angle of model, longer time is spent, and the problem can not be solved, if the method is optimized from the angle of training data, the existing method needs to manually label texts to increase the richness, and the positive and negative text proportion or content in the training data does not need to be adjusted, and resampling is needed, but the obtained result is difficult to be just the missing of the training data due to random sampling.

Therefore, how to solve the model overfitting is a urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a sample data distribution optimization method, a sample data distribution optimization device and a storage medium, which can perform fitting optimization on the whole modeling at a data source.

In a first aspect, an embodiment of the present application provides a sample data distribution optimization method, which is a method with strong versatility capable of optimizing training text in the field of multi-classification tasks, where the method includes:

acquiring a first training sample set, wherein the first training sample set comprises a plurality of training texts, and each training text in the plurality of training texts is marked with an intention;

extracting keywords of a plurality of training texts in the first training sample set to obtain a keyword set;

acquiring word frequency and inverse document frequency of a first keyword in a second training sample set, wherein the first keyword is any keyword in the keyword set, and the second training sample set is a set formed by training texts with the same intention as the training text to which the first keyword belongs; the word frequency is the number of training texts containing the first keywords in the second training sample set and accounts for the proportion of the second training sample set, and the inverse document frequency is used for representing the occurrence frequency of the first keywords in the second training sample set;

According to word frequency and inverse document frequency of each keyword in the keyword set, calculating to obtain a concentration degree score of each keyword corresponding to each intention, wherein the concentration degree score of each keyword corresponding to any intention is used for representing the concentration degree of the keywords in training texts containing the any intention;

determining target keywords according to the concentration degree score of each keyword in the keyword set corresponding to each intention;

performing intention labeling on a third training sample set containing the target keywords and not labeled with intention to obtain a negative sample set;

and processing the first training sample set according to the negative sample set to obtain an updated first training sample set, wherein the first training sample set is used for training to obtain a multi-classification model, and the model is used for predicting the intention classification of the input samples.

The key words of the embodiment of the application are to optimize training data of a classification model so as to solve the problem of model overfitting, and common overfitting reasons include the following points: the model itself is too complex to fit the noise in the training sample set. At this time, a simpler model is needed to be selected or cut; training samples are too few or lack representativeness. At this time, the number of samples needs to be increased, or the diversity of the samples needs to be increased; training the sample noise interferes, resulting in the model fitting the noise, requiring either removal of noise data or modification of the model that is insensitive to noise.

In a multi-classification task, if a word (keyword) appears mostly in only one intention, the model considers that the text of the word appears most likely to be the intention, and the situation that the training data is unbalanced in distribution will result in serious model overfitting and greatly reduce the recognition effect, which is an extremely flexible model in nature can greatly reduce the error (training MSE) of the training data, but because the extremely flexible cost is that the error of the training data is considered in the model estimation process, the prediction capability (test MSE) of the training data to new data (verification data) is affected. Similarly, when the regular fit of the data is inadequate, the test MSE will also be relatively large, which is the under fit.

In general, solving the problem of model overfitting is done from two aspects:

the first is to optimize training data, such as labeling positive and negative proportions of training samples, labeling the richness of text semantics of the data, and the like.

The second is to optimize the model network architecture, such as adding regularization, reducing model training parameters, optimizing training rounds, etc.

For the same training sample set, the more excellent model and training tuning can obtain an algorithm model with better effect. However, during modeling, training data determines the upper limit of the model effect. The good training text can enable the model to converge faster, learn more knowledge and reduce the degree of overfitting.

The method is characterized in that the training sample is searched for a source which leads to overfitting, the label corresponding to a certain word in the training sample is too concentrated, for example, the label corresponding to the word profit is generally corresponding to the word profit or the training text which is intended to be financial, but the label corresponding to the word profit is also likely to be Internet/real property, but if the number of the training texts containing profit is greater than the number of the training texts containing profit of the model, the model obtained through training the training texts is likely to be directly attached with the label of financial when the input text containing profit is predicted, and the situation is the problem that the model is overfitted, therefore, the model can judge that the model containing profit is more than the label containing profit when the key word containing profit is attached with the text, and the model containing profit can judge other profit besides the key word containing profit when the key word containing profit is input.

Specifically, splitting an existing marked first training sample set to split words from all training texts in the first training sample set to obtain a keyword set; determining the concentration degree of each keyword in the keyword set in training texts corresponding to a certain intention, determining whether the concentration degree is similar to the quantity of the training texts containing profit corresponding to the labels of the models or the intention of finance, and solving the overfitting problem of the models by comparing the quantity of the training texts containing profit corresponding to the labels of the Internet or real estate with the quantity of the training texts containing profit corresponding to the labels of the models.

In a further possible implementation manner of the first aspect, the extracting keywords of a plurality of training texts in the first training sample set to obtain a keyword set includes:

word segmentation is carried out on a plurality of training texts in the first training sample set through a word segmentation tool so as to obtain a word set;

Constructing a business keyword library, wherein the business keyword library comprises a plurality of keywords related to business;

and screening the word set according to the service keyword library to obtain the keyword set.

The key point of this embodiment is to determine the keywords in the first training sample set, where the keywords are used to represent the semantics of the training texts corresponding to the keywords, and it can be understood that this step is necessary, and because the method provided in this embodiment searches for the words that occur in the first training text set in the training texts corresponding to some labels, but considers that some word-of-speech assisted words or join words occur frequently, and that related data of all the words in the training texts consumes a lot of time, which is less efficient. Therefore, training texts in the first training sample set are screened to obtain keywords capable of representing text semantics, so that the efficiency of subsequent operation is improved, and further, keywords which are strongly related to the service in the first training sample set are obtained by constructing a service keyword library, so that time is saved.

In a further possible implementation manner of the first aspect, the determining the target keyword according to the concentration score of each keyword in the keyword set corresponding to a respective intention includes:

Ranking the concentration scores of the first keywords in the keyword set corresponding to the intentions to obtain target intentions corresponding to the first keywords with the highest concentration scores;

obtaining a target score duty ratio according to the concentration score of the intention corresponding to the first keyword, wherein the target score duty ratio is the concentration score of the first keyword in the target intention and accounts for the proportion of the sum of the concentration scores of the first keyword corresponding to the intentions;

and determining the first keyword with the target score ratio higher than a preset threshold value as the target keyword.

In this embodiment, the concentration score is based on training text of the intent label, and can represent the concentration degree of the first keyword in the training text corresponding to the intent label, and the higher the concentration score is, the more frequently the first keyword appears in the training text corresponding to the intent label, so the application also constructs a calculation method of improved word frequency-inverse document frequency of co-occurrence relationship between 'in-class' and 'between' keywords of multi-intent corpus, and the calculation formula is:

CR＝freq*idf

where CR is the concentration score, freq is the word frequency, and idf is the inverse document frequency.

However, it should be noted that, the high concentration degree of the first keyword in the training text corresponding to the intent tag does not represent that the first keyword is the source of the model overfitting, and the concentration degree of the first keyword in the training texts corresponding to other intent tags needs to be seen, if the concentration degree of the first keyword in the training texts corresponding to other intent tags is lower than that in the training texts corresponding to the intent tags, that is, the target score ratio exceeds the preset threshold, the source of the model overfitting may be the source, so the process of determining the target keyword is the key implemented in the application.

In a further possible implementation manner of the first aspect, the performing intent labeling on the third training sample set containing the target keyword and not labeled with intent to obtain a negative sample set includes:

constructing an unlabeled third training sample set, wherein the third training sample set comprises a plurality of training samples;

searching training texts containing the target keywords in the unlabeled third training sample set;

inputting the training text into a multi-classification model trained according to the first training sample set for prediction;

And checking the prediction result of the multi-classification model to obtain a negative sample set.

After determining the target keywords, screening the target keywords in a historical corpus, wherein the historical corpus is the third training sample set, so as to obtain training texts containing the target keywords. It should be noted that, the training text in the unlabeled third training sample set does not have a corresponding label, and a lot of manpower and material resources are required for manual labeling, so that the training text is input into a multi-classification model trained according to the first training sample set to predict, and then the prediction result is checked, wherein the negative sample is inconsistent with the intention label of the training text containing the target keyword in the first training sample, that is, the negative sample is required by the method, and is finally summarized into a negative sample set to be input into the first training sample set, and the negative sample set is circularly processed until the optimal distribution of the positive sample and the negative sample of the first training sample set is completed after all the negative sample sets of the target keyword are determined.

In a further possible implementation manner of the first aspect, after processing the first training sample set according to the negative sample set to obtain an updated first training sample set, the method further includes:

Training the multi-classification model trained by the first training sample set according to the updated first training sample set to obtain an updated multi-classification model;

testing the updated multi-classification model according to test data to obtain a test result, wherein the test result comprises a mean square error value, and the mean square error value is the value of the mean square error between the result output by the multi-classification model according to the test data and the result output according to the training data;

judging whether the updated first training sample set causes the fitting problem of the updated multi-classification model or not according to the preset threshold value and the numerical value of the mean square error;

and if the multi-classification model has the fitting problem, optimizing the sample distribution of the updated first training sample set.

In this embodiment, the degree of overfitting of the model is represented by a mean square error value, so that the multi-classification model obtained by training the first training sample set with optimized distribution is verified, leakage detection and deficiency repair are performed on the optimized distribution of the first training sample set, and if a problem occurs, correction is performed in time.

In a further possible implementation of the first aspect,

the calculation formula of the inverse document frequency is as follows:

where idf is the inverse document frequency, m is the number of intentions, Y ^j The total number of training texts representing the jth intention,

the number of training texts including keywords in the jth intention is represented.

The inverse document frequency of the present embodiment is different from the common inverse document frequency in the calculation manner, and the inverse document frequency of the keyword with respect to the intention i is calculated by the calculation logic: the inverse document frequency is calculated and averaged for each of the n-1 intentions that separate from intent i to obtain the inverse document statistics between classes.

In a further possible implementation manner of the first aspect, the calculation formula of the target score ratio is:

the calculation formula of the target score ratio is as follows:

where score is the target score duty cycle, n is the number of intentions in the first training sample set, and max (CR 1, CR2, …, CRn) is the highest concentration score of the first keyword in the corresponding intentions.

In this embodiment, n intentions are traversed, the operation is repeated n times, the concentration score of the first keyword for each intention is obtained, then the first keyword is ranked from large to small, and top1 is the intention with the highest concentration of the first keyword.

And dividing the concentration score of the highest score < intention-first keyword > pairs by the sum of n intention concentration scores corresponding to the first keyword, namely, the ratio of top1 score to total concentration score, so as to obtain a target score duty ratio, thereby explaining whether the first keyword appears in one intention in a vast majority.

In a second aspect, an embodiment of the present application provides a sample data distribution optimizing apparatus, where the apparatus includes at least a first obtaining unit, an extracting unit, a second obtaining unit, a calculating unit, a determining unit, a labeling unit, and a processing unit. The sample data distribution optimizing device is used for implementing the method described in any implementation manner of the first aspect, wherein the first acquisition unit, the extraction unit, the second acquisition unit, the calculation unit, the determination unit, the labeling unit and the processing unit are described as follows:

the first acquisition unit is used for acquiring a first training sample set, wherein the first training sample set comprises a plurality of training texts, and each training text in the plurality of training texts is marked with an intention;

the extraction unit is used for extracting keywords of a plurality of training texts in the first training sample set to obtain a keyword set;

The second acquisition unit is used for acquiring word frequency and inverse document frequency of a first keyword in the keyword set in a second training sample set, wherein the first keyword is any keyword in the keyword set, and the second training sample set is a set formed by training texts with the same intention as the training text to which the first keyword belongs; the word frequency is the number of training texts containing the first keywords in the second training sample set and accounts for the proportion of the second training sample set, and the inverse document frequency is used for representing the occurrence frequency of the first keywords in the second training sample set;

the computing unit is used for computing a concentration degree score of each keyword corresponding to each intention according to the word frequency and the inverse document frequency of each keyword in the keyword set, wherein the concentration degree score of each keyword corresponding to any intention is used for representing the concentration degree of the keywords in training texts containing the any intention;

a determining unit, configured to determine a target keyword according to a concentration score of each keyword in the keyword set corresponding to each intention;

The labeling unit is used for labeling the intention of a third training sample set which contains the target keywords and is not labeled with the intention so as to obtain a negative sample set;

the processing unit is used for processing the first training sample set according to the negative sample set to obtain an updated first training sample set, wherein the first training sample set is used for training to obtain a multi-classification model, and the model is used for predicting the intention classification of the input samples.

In general, solving the problem of model overfitting is done from two aspects:

In a further possible implementation manner of the second aspect, the extraction unit is specifically configured to:

In a further possible implementation manner of the second aspect, the determining unit is specifically configured to:

In this embodiment, the concentration score is based on training text of the intent label, and can represent the concentration degree of the first keyword in the training text corresponding to the intent label, and the higher the concentration score is, the more frequently the first keyword appears in the training text corresponding to the intent label, so the application also constructs a method for calculating the improved word frequency-inverse document frequency of the co-occurrence relationship between the "in-class" keyword and the "between-class" keyword of the multi-intent corpus.

In a further possible implementation manner of the second aspect, the processing unit is specifically configured to:

In a further possible implementation manner of the second aspect, the apparatus further includes:

the training unit is used for training the multi-classification model trained by the first training sample set according to the updated first training sample set so as to obtain an updated multi-classification model;

the test unit is used for testing the updated multi-classification model according to test data to obtain a test result, wherein the test result comprises a mean square error value, and the mean square error value is the value of the mean square error between the result output by the multi-classification model according to the test data and the result output by the training data;

the judging unit is used for judging whether the updated first training sample set causes the over-fitting problem of the updated multi-classification model according to the preset threshold value and the numerical value of the mean square error;

and the optimizing unit is used for optimizing the sample distribution of the updated first training sample set if the multi-classification model has the fitting problem.

In a third aspect, embodiments of the present application provide a sample data distribution optimization apparatus, including a processor, a memory, and a communication interface; a memory having a computer program stored therein; the communication interface is for transmitting and/or receiving data when the processor executes a computer program, the sample data distribution optimizing device being operable to perform the method described in the first aspect or any of the possible implementations of the first aspect.

It should be noted that the processor included in the sample data distribution optimizing apparatus described in the third aspect may be a processor dedicated to performing the methods (referred to as a dedicated processor for convenience of distinction), or may be a processor that performs the methods by calling a computer program, such as a general-purpose processor. In the alternative, the at least one processor may also include both special purpose and general purpose processors.

Alternatively, the above-mentioned computer program may be stored in a memory. For example, the Memory may be a non-transitory (non-transitory) Memory, such as a Read Only Memory (ROM), which may be integrated on the same device as the processor, or may be separately disposed on different devices, and the type of the Memory and the manner in which the Memory and the processor are disposed in the embodiments of the present application are not limited.

In a possible embodiment, the at least one memory is located outside the sample data distribution optimizing device.

In yet another possible embodiment, the at least one memory is located within the sample data distribution optimizing device.

In a further possible embodiment, a part of the memory of the at least one memory is located inside the sample data distribution optimizing device and another part of the memory is located outside the sample data distribution optimizing device.

In this application, the processor and the memory may also be integrated in one device, i.e. the processor and the memory may also be integrated together.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored therein, which when executed on at least one processor, implements the method described in the foregoing first aspect or any of the alternatives of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising a computer program for implementing the method of the first aspect or any of the alternatives of the first aspect, when said program is run on at least one processor.

Alternatively, the computer program product may be a software installation package, which may be downloaded and executed on a computing device in case the aforementioned method is required.

The technical solutions provided in the third to fifth aspects of the present application may refer to the beneficial effects of the technical solutions in the first aspect and the second aspect, and are not described herein again.

Drawings

The drawings that are used in the description of the embodiments will be briefly described below.

FIG. 1 is a schematic architecture diagram of a sample data distribution optimization system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a sample data distribution optimization method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a sample data optimization verification method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a sample data distribution optimizing device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a sample data distribution optimizing device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The following describes a system architecture applied to the embodiment of the present application. It should be noted that, the system architecture and the service scenario described in the present application are for more clearly describing the technical solution of the present application, and do not constitute a limitation on the technical solution provided in the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of the new service scenario, the technical solution provided in the present application is also applicable to similar technical problems.

First, terms related to one or more embodiments of the present invention will be explained.

Word frequency, defined herein as the ratio of the presence of a word in all labeled corpora corresponding to a certain intent. For example, intent a has 2000 labeled text corpora, where keyword a exists in 1600 of 2000 texts, and word frequency is 0.8.

The inverse document frequency (Inverse Document Frequency), which represents the frequency with which a word appears in all text (referred to herein as the intended corresponding text set), should have a low IDF value if it appears in many text.

Overfitting (overfitting), which is caused by that the complex model fits sampling errors during training because the training data contain sampling errors in the model parameter fitting process, is shown as good in algorithm on the training set, but bad in generalization performance on the test set.

The classification model is a kind of supervised learning, and refers to a model for classifying samples into different categories, and the essence of the classification model is that the model predicts labels of the samples by learning a series of prior features.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a sample data distribution optimization system provided in an embodiment of the present application, where the system includes a server 101 and a terminal 102, where the terminal 102 communicates with the server 101 through a network.

The server 101 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers, which is not particularly limited in this application. After receiving the indication information from the terminal 102, the server 101 first obtains a first training sample set from the database, and performs word segmentation and screening on each training text in the first training sample set to obtain a keyword set. Then, the server 101 calculates the importance degree of the corresponding keyword in the corresponding second training sample set according to the word frequency and the inverse document frequency of any keyword in the keyword set in the second training sample set, and evaluates whether the keyword is the target keyword according to the importance degree. If the keyword is a target keyword, downloading unlabeled corpus containing the keyword from a database, inputting the corpus into a multi-classification model obtained through training of the first training sample set, processing a model output result to obtain a negative sample set, adjusting the number of training texts comprising each target keyword in the negative sample set, summarizing the adjusted negative sample set and the first training sample set to obtain an updated first training sample set, and finally transmitting the updated first training sample set to the database.

The training samples, e.g., the first training sample set, and the corpus, e.g., the third training sample set, are all stored in a database, which may be located on server 101 or may exist independently of server 101.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers and portable wearable devices, which is not particularly limited in this application, and is mainly used for controlling the server 101, and optionally, for verifying the output result obtained after inputting the corpus into the multi-classification model obtained through the training of the first training sample set.

Referring to fig. 2, fig. 2 is a flowchart of a sample data distribution optimization method according to an embodiment of the present application, where the sample data distribution optimization method may be implemented based on the system architecture shown in fig. 1, or may be implemented based on other architectures, and the method includes, but is not limited to, the following steps:

step S201: a first training sample set is acquired.

Wherein the first training sample set includes a plurality of training texts, each of the plurality of training texts labeled with an intent.

It will be appreciated that the first training sample training set may be a collection of text in different scenarios, and that, illustratively, in an intelligent multi-turn dialog scenario, the sample text training set may be dialog information between the user and the robot. In the intelligent multi-turn dialogue scene, the training text can be a sentence of dialogue information between the user and the robot, for example, "modify bank card password", "i want to modify mobile phone number", "modify binding mobile phone number", etc., and each sentence segment is a corpus sample. The training texts are marked manually in advance, namely, one training text is associated with one corpus label, and the corpus label is the intention of the training text.

Because the embodiment of the application aims at the problem of model overfitting, optionally, a certain problem exists in the proportion of positive and negative samples in the first training sample set.

Step S202: and extracting keywords of a plurality of training texts in the first training sample set to obtain a keyword set.

The keywords are words in the training text, which can represent the training text, and the keyword set refers to a total name or a set of one or more keywords. While text refers to a representation of a written language, typically a sentence or a combination of sentences having a complete, systematic meaning, a text may be a sentence, a paragraph or a chapter, all of which belong to the text.

In an alternative embodiment, the keyword set is obtained through a word segmentation tool and a keyword library, which is specifically as follows:

firstly, word segmentation is carried out on a plurality of training texts in the first training sample set through a word segmentation tool so as to obtain a word set; in the embodiment, the word segmentation tool is jieba word segmentation, the jieba is a Chinese word segmentation library commonly used in the NLP field, simple word segmentation, parallel word segmentation and command line word segmentation can be performed, and keyword extraction, part-of-speech tagging, word position query and the like are supported. Optionally, setting keys for the split words and the attributive training texts so as to trace back the corresponding training texts according to the words at a later stage, and avoiding finding out a plurality of keywords in one training text; summarizing the split words to obtain a word set.

Secondly, constructing a business keyword library according to requirements, wherein the business keyword library comprises a plurality of keywords related to business, and in the example of banking business, the corresponding business keyword library comprises various banking business vocabularies, such as: the bank card, the password, the withdrawal, the remittance, the identity card, the mobile phone number and the like can improve the keyword extraction efficiency and reduce the time consumption of model operation;

and finally, screening the word set according to the service keyword library to obtain the keyword set.

In the process, the word set and the business keyword library are subjected to one comparison, and the successfully compared words are added into the keyword set until all keywords in the word set are compared.

Step S203: and acquiring word frequency and inverse document frequency of the first keyword in the keyword set in the second training sample set.

The first keywords are any keywords in the keyword set, and the second training sample set is a set formed by training texts with the same intention as the training texts to which the first keywords belong; the word frequency is the number of training texts containing the first keywords in the second training sample set and accounts for the proportion of the second training sample set, and the inverse document frequency is used for representing the frequency of occurrence of the first keywords in the second training sample set.

It should be noted that, in step S203, the word frequency and the inverse document frequency of each keyword in the keyword set need to be calculated. The second training sample sets are associated with keywords, at least one second training sample set corresponds to each keyword, and the second training sample sets corresponding to any keyword are different.

It should be noted that, in the present application, the calculation of word frequency and inverse document frequency is performed by taking the intention corresponding to a keyword as a unit, so as to calculate the concentration degree of the keyword in the corresponding intention and the corresponding intentions, and the calculation is performed by taking the intention corresponding to the keyword as a unit, for example, the intention labels of the training text corresponding to the keyword "profit" are plural, including: "finance", "real estate", "internet", etc., and the word frequency is calculated by traversing each intention label for calculating the duty ratio of the keyword "profit" in each intention label according to the calculation formula, wherein the duty ratio of the training text containing "profit" in the training text corresponding to the intention "finance", and the calculation formula is as follows:

wherein freq is word frequency, X _w And X is the total number of training texts corresponding to the ith intention, wherein the ith intention can be any intention marked on the training texts corresponding to the keywords.

Still taking the keyword "profit" as an example, the calculation logic of the inverse document frequency calculates and averages the inverse document frequencies for m-1 intents other than the intent i, respectively, to obtain the inverse document frequency of the keyword "profit" among the intents other than the intent i.

It will be appreciated that the higher the word frequency, the more important the keyword can be initially judged to be for training text.

The calculation formula of the inverse document frequency is as follows:

It will be appreciated that the higher the inverse document frequency, the less important the keyword can be initially judged to be to training text.

When the word frequency of the keyword profit is calculated, the corresponding inverse document frequency is calculated correspondingly, and the word frequency and the inverse document frequency are inseparable.

Alternatively, after the word frequency and inverse document frequency calculation of the keyword "profit" in all intents is completed, the calculation of the word frequency and inverse document frequency of the next keyword is started.

In general, TF-IDF (term frequency-inverse document frequency) is a statistical method commonly used for text processing, and can evaluate the importance of a word in a document. The method can be used for extracting the document keywords simply, and is calculated according to word frequency and inverse document frequency.

However, in the embodiment of the present application, the calculation manner of the inverse document frequency is different from the calculation manner of the common inverse document frequency, which is a frequency of occurrence of keywords in all training texts, but the inverse document frequency of the present application is calculated based on a single intended training text, and each keyword is classified into multiple intentions, that is, each keyword has multiple inverse document frequencies, so as to determine the concentration degree of the keywords in the first training text in the present embodiment.

Step S204: and calculating the concentration degree score of each keyword corresponding to each intention according to the word frequency and the inverse document frequency of each keyword in the keyword set.

The concentration degree score of the keyword corresponding to any intention is used for representing the concentration degree of the keyword in training texts containing the any intention, so that the co-occurrence relation of the keyword in the class and among the classes is described; the calculation formula of the concentration score is as follows:

CR＝freq*idf

It will be appreciated that the higher the concentration score, the more important the keyword may be initially judged to be for the intended training text and not for other training text. Aiming at the characteristics of the multi-classification training data set, a set of concentration degree calculation method of intention-keyword dimension is constructed, and the problem of model overfitting caused by uneven distribution possibly occurring in the original training data set is solved. In the practical test, the final test precision of the model can be improved by about 2% under the same set of classification algorithm of the training set obtained by the method.

Optionally, each keyword in the keyword set has a plurality of concentration scores, the concentration scores are related to intentions corresponding to the keywords, taking the keyword "profit" as an example, intent labels of training texts corresponding to the keyword "profit" are set to be three, and the method includes: "finance", "real estate", "internet", then there are three concentration scores of the keyword "profit", corresponding to the three intention labels respectively.

Step S205: and determining target keywords according to the concentration degree scores of each keyword in the keyword set corresponding to the intentions.

Specifically, ranking the concentration degree scores of the first keywords in the keyword set corresponding to the intentions to obtain target intentions corresponding to the first keywords with the highest concentration degree scores; still taking the keyword "profit" as an example, the intention labels of the training texts corresponding to the keyword "profit" are set to three, including: the concentration degree of the keyword profit in the meaning of finance is 90 points, the concentration degree of the keyword profit in the meaning of real estate is 5 points, the concentration degree of the keyword profit in the meaning of internet is 5 points, and the concentration degree of the keyword profit in the meaning of finance is highest, and the target intention is finance as can be known through sorting and comparison.

the calculation formula of the target score ratio score is as follows:

the calculation formula of the target score ratio is as follows:

In the above example, the target score ratio of the keyword "profit" is 90/100=0.9, and in this embodiment, the preset threshold value is 0.85, and if the target score ratio of the keyword "profit" exceeds the preset threshold value, the target keyword may be determined, and the specific value of the preset threshold value needs to be flexibly set for different task requirements, where the target keyword is a word that causes a high risk of model fitting.

Step S206: and labeling the intention of a third training sample set which contains the target keywords and is not labeled with the intention, so as to obtain a negative sample set.

The third training sample set comprises a plurality of training samples, and the target keywords are contained in the plurality of training samples; the target keywords can be one or more, and if the target keywords are more than one, each target keyword is provided with a corresponding third training sample set.

In an alternative embodiment, an unlabeled third training sample set is first constructed.

In practical applications, there are various ways to obtain the third training sample set, for example, an operator may send a related instruction to an execution body, or send an instruction to obtain the third training sample set, and correspondingly, the execution body, for example, a server starts to obtain the third training sample set after receiving the instruction; the server may automatically obtain the third training sample set every other preset time, for example, after the preset time, the server automatically obtains the third training sample set; or after a preset time length, the terminal with the original corpus extraction function automatically acquires a third training sample set. The manner in which the third training sample set is obtained is not limited in this specification.

In addition, the third training sample set may be a Document in any format, a Document in DOC (Document) format, a Document in txt format, a Document in image format, or a Document in PDF (Portable DocumentFormat) format, which is not limited in this specification.

After the third training sample set is obtained, text content of the third training sample set may be extracted: and selecting a corresponding text box extraction tool according to the format of the third training sample set, and extracting a text box from the third training sample set through the text box extraction tool, wherein the text box contains characters forming the text content or texts forming the text content. In this way, the text box extracting tool corresponding to the format of the third training sample set is selected to extract the text box, so that the accuracy and speed of extracting the text content can be improved.

For example, if the obtained third training sample set is in a PDF format, a pdfominer tool corresponding to the PDF format is selected, and an extraction operation is performed on the third training sample set, so that at least one text box containing text contents in the third training sample set is extracted, and the text contents of the third training sample set are obtained. For another example, if the obtained third training sample set is in an image format, an optical character recognition tool (OCR, optical Character Recognition) corresponding to the image format is selected, and an extraction operation is performed on the third training sample set, so that at least one text box containing text content in the third training sample set is extracted, and the text content of the third training sample set is obtained.

Searching training texts containing the target keywords in the unlabeled third training sample set. If the target keywords are only one, searching is not needed, and if the target keywords are a plurality of target keywords, the third training sample set is arranged so that each target keyword has a corresponding third training sample set consisting of training texts of the target keywords.

And inputting the training text containing the target keywords into a multi-classification model trained according to the first training sample set for prediction, and checking the prediction result of the multi-classification model. Because the multi-classification model is obtained by training based on the first training sample set, the result output by the model for the training text containing the target keywords still has the fitting condition, so that the output result is checked, the result which is identified as the target intention is taken out and the label is checked, and the checking process can be performed manually or can be completed by checking according to a preset program.

And finally, after checking, summarizing the corrected training texts to form a negative sample set, wherein the uncorrected training texts can be used as the supplement of the positive texts of the first training sample set, so that the correction training texts are convenient to use in the later-stage corpus expansion.

Step S207: and processing the first training sample set according to the negative sample set to obtain an updated first training sample set.

The first training sample set is used for training to obtain a multi-classification model, and the model is used for predicting the intention classification of the input samples.

And injecting the negative sample set into the first training sample set according to the proportion of positive and negative texts of the first training sample set so as to obtain an updated first training sample set. In general, the positive-negative text ratio in the training data is 1: and 3, the number of the negative samples far exceeds the number of the positive samples, and if the number of the samples in the negative sample set is insufficient, the negative sample set is filled according to the methods of word replacement, noise addition and the like.

In an alternative implementation manner, after the first training sample set is processed according to the negative sample set to obtain an updated first training sample set, a verification is performed on the first training sample set after the optimization distribution, and fig. 3 is a schematic flow chart of a sample data optimization verification method provided in the embodiment of the present application, where the method includes:

step S301: and training the multi-classification model trained by the first training sample set according to the updated first training sample set to obtain an updated multi-classification model.

Step S302: and testing the updated multi-classification model according to the test data to obtain a test result.

The test data are other training texts which contain target keywords and are different from the training texts in the first training sample set, the test results comprise mean square error values, and the mean square error values are the values of the mean square errors between the results output by the multi-classification model according to the test data and the results output by the training data;

the formula for calculating the mean square error MSE is as follows:

step S303: and judging whether the updated first training sample set causes the fitting problem of the updated multi-classification model or not according to the preset threshold value and the numerical value of the mean square error.

MSE Mean Square Error (MSE) is used as a method frequently used for a loss function in machine learning and used for representing the degree of difference between a predicted value and an actual value, the MSE consists of two parts of variance of point estimation and square of deviation, in general, the deviation is small in model training, the variance is large, and overfitting is easy to generate.

Step S304: and if the multi-classification model has the fitting problem, optimizing the sample distribution of the updated first training sample set.

Optionally, if the fitting problem occurs again, a negative sample set is obtained, and the negative sample set is determined to be injected into the first training sample set or replaced with the negative sample in the first training sample set according to the actual situation. The training text of the negative set here should be different from the training text of the negative set in fig. 2 described above.

In summary, the embodiment of the application provides an end-to-end optimization method, which is used for adaptively searching a target keyword with high fitting risk, and taking the target keyword as an optimization target to perform a downstream negative sample sampling task; secondly, constructing a calculation method of improved word frequency-inverse document frequency of keyword co-occurrence relations in the multi-intention corpus and between the classes so as to clearly express the keyword co-occurrence relations in the multi-intention corpus and between the classes; finally, a matched negative sample sampling method is obtained through processing the target keywords and filtering logic of a new training text containing the target keywords, and the training text distribution is optimized by adding a new negative sample corpus.

The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a sample data distribution optimizing apparatus 40 according to an embodiment of the present application, where the apparatus 40 may be a device in the aforementioned server, and the apparatus 40 may include a first obtaining unit 401, an extracting unit 402, a second obtaining unit 403, a calculating unit 404, a determining unit 405, a labeling unit 406, and a processing unit 407, where the respective units are described in detail below.

A first obtaining unit 401, configured to obtain a first training sample set, where the first training sample set includes a plurality of training texts, and each training text in the plurality of training texts is labeled with an intention;

an extracting unit 402, configured to extract keywords of a plurality of training texts in the first training sample set, to obtain a keyword set;

a second obtaining unit 403, configured to obtain word frequency and inverse document frequency of a first keyword in the keyword set in a second training sample set, where the first keyword is any keyword in the keyword set, and the second training sample set is a set formed by training texts with the same intention as a training text to which the first keyword belongs; the word frequency is the number of training texts containing the first keywords in the second training sample set and accounts for the proportion of the second training sample set, and the inverse document frequency is used for representing the occurrence frequency of the first keywords in the second training sample set;

a calculating unit 404, configured to calculate, according to the word frequency and the inverse document frequency of each keyword in the keyword set, a concentration score of each keyword corresponding to each intention, where the concentration score of each keyword corresponding to any intention is used to characterize the concentration degree of the keyword in a training text containing the any intention;

A determining unit 405, configured to determine a target keyword according to a concentration score of each keyword in the keyword set corresponding to each intention;

a labeling unit 406, configured to perform intent labeling on a third training sample set containing the target keyword and having no intent labeled, so as to obtain a negative sample set;

the processing unit 407 is configured to process the first training sample set according to the negative sample set to obtain an updated first training sample set, where the first training sample set is used for training to obtain a multi-classification model, and the model is used for predicting intent classification of the input samples.

In a possible implementation manner, the extracting unit 402 is specifically configured to:

In a possible implementation manner, the determining unit 405 is specifically configured to:

In a possible implementation manner, the processing unit 407 is specifically configured to:

In one possible embodiment, the apparatus 40 further comprises:

Referring to fig. 5, fig. 5 is a schematic structural diagram of a sample data distribution optimizing apparatus 50 according to an embodiment of the present application, where the sample data distribution optimizing apparatus 50 includes: a processor 501, a communication interface 502 and a memory 503. The processor 501, the communication interface 502, and the memory 503 may be connected by a bus or other means, which is exemplified in the embodiment of the present application.

The processor 501 is a computing core and a control core of the sample data distribution optimizing device 50, and may parse various instructions in the sample data distribution optimizing device 50 and various data of the sample data distribution optimizing device 50, for example: the processor 501 may be a central processing unit (Central Processing Unit, CPU) that may transfer various types of interaction data between internal structures of the sample data distribution optimizing device 50, and so on. Communication interface 502 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), and may be controlled by processor 501 to receive and transmit data; the communication interface 502 may also be used for transmission or interaction of signaling or instructions within the sample data distribution optimization device 50. The Memory 503 (Memory) is a Memory device in the sample data distribution optimizing apparatus 50 for storing programs and data. It will be appreciated that the memory 503 herein may include either a built-in memory of the sample data distribution optimizing device 50 or an extended memory supported by the sample data distribution optimizing device 50. The memory 503 provides a storage space storing the operating system of the sample data distribution optimizing device 50, and also storing program codes or instructions required by the processor to perform the corresponding operations, and optionally, storing related data generated after the processor performs the corresponding operations.

In the present embodiment, the processor 501 executes executable program code in the memory 503 for performing the following operations:

In an alternative, in the aspect of extracting keywords of a plurality of training texts in the first training sample set to obtain a keyword set, the processor 501 is specifically configured to:

In an alternative, in determining the target keyword according to the concentration score of each keyword in the keyword set corresponding to the respective intent, the processor 501 is specifically configured to:

In an alternative, in the case of intent labeling the third training sample set containing the target keyword and having no intent to obtain a negative sample set, the processor 501 is specifically configured to:

In an alternative, the processor 501 is further configured to:

Embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform operations performed by a server in embodiments.

Embodiments of the present application also provide a computer program product that, when run on a processor, implements the operations performed by the server in embodiments.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by a program that instructs related hardware, and the program may be stored in a computer-readable storage medium, and the program may include the above-described embodiment methods when executed. And the aforementioned storage medium includes: various media capable of storing program code, such as ROM, RAM, magnetic or optical disks.

Claims

1. A method for optimizing sample data distribution, the method comprising:

2. The method of claim 1, wherein extracting keywords of a plurality of training texts in the first training sample set to obtain a keyword set comprises:

3. The method of claim 1, wherein the determining the target keyword based on the concentration score for each keyword in the set of keywords corresponding to a respective intent comprises:

4. The method of claim 1, wherein the intent labeling of the third training sample set containing unlabeled intent of the target keyword to obtain a negative sample set comprises:

5. The method of claim 1, further comprising, after processing the first training sample set according to the negative sample set to obtain an updated first training sample set:

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the calculation formula of the inverse document frequency is as follows:

7. The method of claim 3, wherein the step of,

the calculation formula of the target score ratio is as follows:

8. A sample data distribution optimizing apparatus, the apparatus comprising:

9. A sample data distribution optimizing device, characterized in that the sample data distribution optimizing device comprises at least one processor, a communication interface for transmitting and/or receiving data, and a memory for storing a computer program, the at least one processor being adapted to invoke the computer program stored in the at least one memory for implementing the method according to any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a processor, implements the method according to any of claims 1-7.