WO2024021526A1 - 训练样本的生成方法、装置、设备和存储介质 - Google Patents

训练样本的生成方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2024021526A1
WO2024021526A1 PCT/CN2022/143938 CN2022143938W WO2024021526A1 WO 2024021526 A1 WO2024021526 A1 WO 2024021526A1 CN 2022143938 W CN2022143938 W CN 2022143938W WO 2024021526 A1 WO2024021526 A1 WO 2024021526A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
samples
information entropy
probability distribution
distribution data
Prior art date
Application number
PCT/CN2022/143938
Other languages
English (en)
French (fr)
Inventor
郭顺
陈成才
Original Assignee
上海智臻智能网络科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海智臻智能网络科技股份有限公司 filed Critical 上海智臻智能网络科技股份有限公司
Publication of WO2024021526A1 publication Critical patent/WO2024021526A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of computer data processing technology, and specifically to a training sample generation method, device, equipment and storage medium.
  • This application provides a training sample generation method, device, equipment and storage medium to improve the efficiency of sample labeling to a certain extent.
  • This application provides a method for generating training samples, including: generating probability distribution data of samples relative to multiple sample categories; wherein the probability distribution data represents the probability that a sample belongs to different sample categories; according to the probability distribution The data determines the information entropy value of the sample; when the information entropy value meets the preset conditions, a category label of the sample category determined according to the probability distribution data is added to the sample to generate a training sample.
  • the present application provides a device for generating training samples, including: a generation module for generating probability distribution data of samples relative to multiple sample categories; wherein the probability distribution data represents the probability that a sample belongs to different sample categories; determining A module for determining the information entropy value of the sample based on the probability distribution data; and an adding model for adding the information entropy value determined based on the probability distribution data to the sample when the information entropy value meets the preset conditions.
  • the category label of the sample category to generate training samples.
  • the present application provides a computer device, which includes a memory and a processor.
  • the memory stores a computer program.
  • the processor executes the computer program, the above method for generating training samples is implemented.
  • the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the above-mentioned method for generating training samples is implemented.
  • This application determines the information entropy corresponding to the sample through the probability distribution data of the sample.
  • labeling information is automatically added to the sample, which improves the efficiency of sample labeling to a certain extent.
  • Figure 1 is a schematic diagram of the interaction between different ends in a scenario example provided by this application.
  • Figure 2 is a schematic diagram of the interaction between different ends in a scenario example provided by this application.
  • Figure 3 is a schematic flow chart of the training sample generation method provided by this application.
  • Figure 4 is a schematic diagram of a training sample generating device provided by this application.
  • Figure 5 is a schematic diagram of a computer device provided by this application.
  • each sample needs to be displayed to the annotator through a monitor. After viewing the sample, the annotator labels the sample based on personal experience. Specifically, the sample category of the sample can be marked. This requires more labor costs and time costs, making the sample labeling efficiency lower.
  • the information entropy threshold is obtained based on the labeled samples and classification model.
  • the probability distribution data of the samples can be generated through a classification model, and the information entropy value can be further calculated.
  • the sample category of the sample predicted by the classification model is used as the label of the training sample to solve the technical problem of low sample efficiency.
  • the system may include at least one server and client.
  • the server may run a method for generating training samples.
  • the client may be used to receive user annotation information for some samples.
  • the server can obtain large amounts of text data through crawler technology. A large number of unlabeled samples can be obtained by preprocessing the text data. The server can then divide the samples into sample subsets corresponding to different sample categories through a clustering method. Corresponding to each sample subset, the server can randomly select a small number of samples to obtain initial samples.
  • the server can then send the initial sample to the client.
  • the client can provide it to the user, and further receive the user's annotation information on the initial sample to form an initial training sample.
  • the client can then send the initial training samples to the server.
  • the initial training samples may include multiple sample categories.
  • the server can train the machine learning model through the initial training samples.
  • the machine learning model can be used to predict the sample category of the sample.
  • the verification set does not need to be specified, and the iteration can be stopped when the output loss of the initial training samples no longer decreases.
  • the server can then predict the initial training samples through the machine learning model trained by the initial training samples to determine the initial training samples that can be accurately predicted.
  • the server can generate probability distribution data of the initial training sample through a machine learning model.
  • the probability distribution data is used to represent the probability that the initial training sample belongs to different sample categories.
  • the predicted sample category is determined. Comparing the predicted sample categories with labels can determine the initial training samples with accurate predictions.
  • the information entropy of the probability distribution data of the initial training sample can be calculated, and the average value of the information entropy can be selected as the information entropy threshold.
  • the machine learning model can be used to calculate probability distribution data that the samples belong to different sample categories. Moreover, when the information entropy value of the probability distribution data of the sample is less than the information entropy threshold, the sample category corresponding to the maximum probability in the probability distribution data of the sample can be used as the label of the sample to generate a training sample. . When the information entropy value of the probability distribution data of the sample is greater than the information entropy threshold, the sample may be sent to the client to receive manual annotation information to generate training samples.
  • the server can retrain the machine learning model through manually labeled training samples and computer automatically labeled training samples.
  • the training sample generation system may include a client and a server.
  • the client can be used to display sample information to the user, and can be used to receive the user's annotation information for the training sample.
  • the client may be an electronic device with network access capabilities.
  • the client can be a desktop computer, a tablet computer, a laptop computer, a smartphone, a digital assistant, a smart wearable device, a shopping guide terminal, a television, a smart speaker, a microphone, etc.
  • smart wearable devices include but are not limited to smart bracelets, smart watches, smart glasses, smart helmets, smart necklaces, etc.
  • the client may also be software capable of running in the electronic device.
  • the server may be used to execute a method for generating training samples.
  • the server can be an electronic device with certain computing processing capabilities. It may have a network communication module, a processor, a memory, etc. Of course, the server may also refer to software running in the electronic device.
  • the server can also be a distributed server, which can be a system with multiple processors, memories, network communication modules, etc. that work together.
  • the server may also be a server cluster formed by several servers.
  • the server may also be a new technical means that can realize the corresponding functions of the embodiments of the specification. For example, it can be a new form of "server" based on quantum computing.
  • This application provides a method for generating training samples.
  • the training sample generating method can be applied to the server.
  • the method for generating training samples may include the following steps.
  • Step S101 Generate probability distribution data of samples relative to multiple sample categories; wherein the probability distribution data represents the probability that a sample belongs to different sample categories.
  • the probability distribution data of a sample relative to multiple sample categories can be used to determine the confidence level that the sample belongs to different sample categories. Furthermore, if the confidence level is high, labels can be added to the samples to save the cost of manual labeling to a certain extent.
  • the sample may represent data that can be used to generate training samples.
  • the sample may only include characterization data without a label indicating the sample category of the sample.
  • the characterizing data may be characteristics of the sample.
  • the sample may be conversation text.
  • the representation data may be the text content of the dialogue text.
  • the representation data may also be a feature vector obtained based on the text content.
  • the sample category may be used to classify the sample.
  • the sample categories can be manually defined. For example, for text data, different emotion categories can be tagged for the text data.
  • the sample categories can also be automatically generated by the server. For example, a clustering algorithm can be used to classify the sample into multiple categories according to the characteristics of the sample.
  • the probability distribution data can be used to represent the probability that the samples belong to different sample categories.
  • the plurality of probability data included in the probability distribution data may respectively correspond to different sample categories.
  • the probability distribution data may be normalized data or unnormalized data. Specifically, for example, corresponding probability distribution data can be generated based on a sample.
  • the probability distribution data can be represented by a four-dimensional vector: (0.2, 0.05, 0.1, 0.65). Among them, each dimension of the four-dimensional vector can correspond to a sample category. The value of each dimension of the four-dimensional vector can be used to represent the probability that the sample belongs to the sample category represented by the dimension. Among them, the probability value of the fourth dimension can be 65%, which means that the probability that the sample belongs to the sample category represented by the fourth dimension is 65%.
  • the method of generating probability distribution data of a sample with respect to multiple sample categories may be based on predicting the representation data of the sample based on a machine learning model.
  • the sample may be text data.
  • the machine learning model can be used to predict emotional information of the text data.
  • the machine learning model can output a probability distribution data including three probability data.
  • each probability data can correspond to different emotion categories. For example, they can correspond to "happy", "sad” and "angry” respectively.
  • the method of generating probability distribution data of samples relative to multiple sample categories can also be determined based on preset rules.
  • the server may be preset with a dictionary formed based on words representing different emotion categories. The words belonging to different emotional categories in the text data are counted, and the probability distribution data can be generated according to the number of words belonging to different emotional categories in the text data.
  • Step S102 Determine the information entropy value of the sample according to the probability distribution data.
  • probability distribution data can be used to characterize the probability that a sample belongs to different categories. It is not obvious to quantify the certainty that a sample belongs to a certain sample class directly from probability distribution data. Therefore, the certainty that a sample belongs to a certain sample category can be determined through the information entropy value of probability distribution data.
  • determining the information entropy value of the sample based on the probability distribution data can be obtained by calculating the information entropy. Specifically, determining the information entropy value of the sample based on the probability distribution data can be calculated by formula 1.
  • p( xi ) can represent the probability data of the i-th dimension in the probability distribution data.
  • H(X) can represent the information entropy value of the probability distribution data of the sample.
  • k may represent the number of categories represented by the probability distribution data.
  • Step S103 When the information entropy value meets the preset condition, add a category label of the sample category determined based on the probability distribution data to the sample to generate a training sample.
  • the information entropy value may be used to represent uncertainty in the probability distribution data.
  • the uncertainty of the probability distribution data can be considered to be low.
  • the certainty of the probability distribution data can be considered to be relatively high. Therefore, the sample category corresponding to the largest probability data in the probability distribution data can be used as the label of the sample, which reduces the cost of labeling to a certain extent and improves the efficiency of labeling.
  • the preset condition may indicate that the information entropy value obtained based on probability distribution data is less than a specified information entropy threshold.
  • the information entropy threshold can be determined based on the experience of experts, or can be determined based on the statistical situation of generating information entropy values based on probability distribution data of historical samples.
  • the information entropy value obtained from the probability distribution data is less than the specified information entropy threshold condition, it can be considered that the probability distribution data of the sample shows that the uncertainty of the sample belonging to a certain sample category is low and the certainty is relatively high. Therefore, A category label of a sample category determined based on the probability distribution data may be added to the sample to generate a training sample.
  • the preset condition may also be that the information entropy value generated based on probability distribution data is within a specified range.
  • the method of adding a category label of a sample category determined based on the probability distribution data to the sample to generate a training sample may be to use the sample category represented by the largest probability data in the probability distribution data as the sample category. Describe the label of the sample.
  • training samples can be formed to further train the machine learning model and improve the performance of the model.
  • the method for generating training samples may further include: obtaining initial training samples; wherein the initial training samples include characterization data and category labels representing sample categories; determining the classification according to the initial training samples Model; wherein, the classification model is used to generate probability distribution data of the sample relative to the multiple sample categories.
  • a classification model can be trained with some initial training samples having class labels. Based on the classification model, samples can be labeled to reduce labeling costs to a certain extent and improve labeling efficiency.
  • the classification model obtained based on some of the initial training samples with category labels can also be used to filter samples. For samples that can be identified more accurately by the classification model, they do not need to be added to the training set to improve model performance. Further annotating some samples that cannot be accurately identified by the classification model and then training the classification model can reduce the number of training samples of the model to a certain extent and take into account the training effect of the model to a certain extent. This can save the training time of the model to a certain extent.
  • the initial training samples may have training samples with corresponding category labels.
  • the training sample may include characterization data and a category label representing the sample category of the sample.
  • the characterization data may be used to represent quantifiable information included in the sample. Specifically, for example, for text-type samples, text information can be used as the representation data. Of course, vectors generated based on the text information can also be used as the representation data. In some embodiments, the characterization data may be characteristics of the sample.
  • the category label may represent the sample category of the initial training sample.
  • the representation data included in the initial training sample may be a three-dimensional tensor formed based on the pixel values of the image.
  • the sample category may be used to represent the category of the image.
  • the image may be a natural scenery image or a portrait image.
  • the category label may be a multidimensional vector. Among them, each dimension corresponds to a sample category.
  • the category label may be a three-dimensional vector, in which the value of the first dimension is 1 and the values of the remaining dimensions are 0.
  • the category label may indicate that the sample belongs to the sample category represented by the first dimension of the vector.
  • the category labels of the initial training samples may be manually labeled or automatically labeled by a computer.
  • the method of obtaining initial training samples may be to randomly select some samples from a large number of unlabeled samples and hand them over for manual labeling to form the initial training samples.
  • the initial training samples may also be historical training samples stored in the database.
  • the method of obtaining the initial training samples may also be obtained by reading the database.
  • the classification model may be a model for classifying samples according to their representation data.
  • the model may be a machine learning model.
  • the classification model can be used to generate probability distribution data based on the representation data of the input sample. Based on the probability distribution data, the sample category of the sample can be determined.
  • the initial classification model can be trained through the initial training samples, and the classification model can be obtained after the training is completed.
  • the method for generating training samples may further include: generating a probability distribution of the initial training samples relative to the multiple sample categories based on the classification model and the characterization data of the initial training samples. Data; calculate the information entropy value corresponding to the initial training sample according to the probability distribution data corresponding to the initial training sample, for setting the preset condition.
  • the initial training samples can be used to set the preset conditions for labeling the samples.
  • the method of generating probability distribution data of the initial training sample relative to the multiple sample categories based on the classification model and the representation data of the initial training sample can input the representation data of the initial training sample into the classification model to obtain the probability distribution data of the initial training sample. Further, the sample category of the initial training sample can be predicted based on the probability distribution data. Furthermore, by comparing the predicted sample category of the initial training sample with the sample category represented by the label of the initial training sample, at least some of the initial training samples that the classification model can accurately predict can be obtained.
  • the method of calculating the information entropy value corresponding to the initial training sample can be based on Formula 1 to calculate the information entropy value of the initial training sample.
  • the preset condition may be set based on the information entropy value of the initial training sample. Further, the preset condition can be set corresponding to the information entropy value of the initial training sample that the classification model can accurately predict.
  • the preset condition may be that the information entropy value of the probability distribution data of the sample is lower than a specified threshold.
  • the information entropy threshold can be determined through initial training samples to avoid, to a certain extent, determining the information entropy threshold through manual experience and affecting accuracy.
  • the information entropy values corresponding to the initial training samples can be calculated to obtain the information entropy threshold.
  • the method for generating the training sample may further include: using the largest information entropy value among the information entropy values corresponding to the initial training sample as the information entropy threshold for setting the predetermined value. Assume conditions; or, use the average value of the information entropy value corresponding to the initial training sample as the information entropy threshold to set the preset condition; or, use the average value of the information entropy value corresponding to the initial training sample as the information entropy threshold.
  • the median is used as the information entropy threshold to set the preset conditions.
  • different information entropy threshold calculation methods can be selected for different task requirements.
  • the information entropy value with the largest value among the initial training sample information entropy values may be used as the information entropy threshold.
  • the preset condition may be expressed as the information entropy value of the sample is greater than the information entropy threshold.
  • the method of adding a category label of the sample category determined according to the probability distribution data to the sample to generate a training sample may be to send the sample To the client, manual annotation is performed to obtain the corresponding category labels to generate training samples.
  • some training samples that cannot be well recognized by the classification model can be provided for manual annotation to further train the classification model and improve the recognition performance of the classification model.
  • the performance of the classification model can be improved with a smaller number of samples.
  • it also avoids further training of samples that can be better identified by the classification model, saving part of the model training time.
  • the preset condition may also be that the information entropy value of the sample is less than the information entropy threshold.
  • the method of adding a category label of the sample category determined according to the probability distribution data to the sample to generate a training sample may also be based on the probability.
  • the server automatically generates a category label for the sample category corresponding to the probability data with the largest value in the distribution data.
  • the information entropy threshold is generated by the maximum value of the information entropy value corresponding to the initial training sample, and when the information entropy value of the sample is less than the information entropy threshold, the server automatically generates the information entropy threshold of the sample.
  • the category labeling method may reduce the accuracy of sample category labeling to a certain extent due to the large information entropy threshold. Therefore, the average of the information entropy values of the initial training samples can be used as the information entropy threshold.
  • the average of the information entropy values of the initial training samples can also be used as the information entropy threshold.
  • the preset condition may be that the information entropy value of the sample is less than the information entropy threshold.
  • the method of adding a category label of the sample category determined based on the probability distribution data to the sample to generate a training sample may be to extract the sample from the probability distribution data. The sample category corresponding to the probability data with the largest value is used as the category label of the sample.
  • the step of determining a classification model based on the initial training samples may include: using the initial training samples as input to the initial classification model, and using the category labels of the initial training samples as the initial
  • the target output of the classification model is used to train the initial classification model to obtain the classification model, so that the classification model tends to completely fit the initial training sample.
  • the initial classification model in the process of training an initial classification model through the initial training samples to obtain the classification model, can be made to tend to completely fit the initial training samples.
  • the classification model in the process of generating probability distribution data based on the classification model, since the classification model tends to completely fit the initial training sample, it can be considered that when the characteristics of the sample are relatively close to the characteristics of the initial training sample, The classification model can correctly classify the sample.
  • category labels can be added to the training samples, and the accuracy of the category labels is ensured to a certain extent.
  • the classification model tends to completely fit the initial training samples, which reduces the generalization performance to a certain extent, it improves the accuracy of adding category labels to the samples through the probability distribution data of the classification model.
  • the samples labeled by the classification model can further train the classification model to improve the performance of the classification model.
  • samples that cannot be classified correctly can also be manually labeled, which can make up for the problem of a certain decline in generalization performance to a certain extent.
  • the information entropy value of the initial training samples obtained according to the classification model may be relatively low.
  • the information entropy value of the sample meets the condition of being lower than the information entropy threshold, it can also mean that the uncertainty that the sample belongs to the sample category is small, so the accuracy of the label added to it is relatively high.
  • the method for generating training samples may further include: performing clustering processing on the samples to obtain multiple sample subsets corresponding to different clusters; selecting samples from the sample subsets as targets Sample; obtain the manually added category label of the target sample to generate the initial training sample.
  • some samples can be randomly selected and provided to staff for labeling to obtain initial training samples.
  • the proportion of samples in different sample categories may vary greatly. This may result in the model's insufficient recognition ability for some samples with fewer sample categories. Therefore, the samples can be clustered to obtain multiple sample subsets corresponding to different clusters. Among them, the samples of each sample subset can be considered to have certain similarities. By selecting samples from the sample subset as target samples, it is possible to ensure that the numbers of initial training samples of different sample categories are similar to a certain extent, so as to ensure to a certain extent the model performance obtained by training based on the initial training samples.
  • a method for clustering the samples to obtain multiple sample subsets corresponding to different clusters may be to perform a clustering operation on the samples through a clustering algorithm.
  • the clustering algorithm may be K-means algorithm (k-means) or nearest neighbor node algorithm (KNN).
  • Selecting samples from the sample subset, as a target sample method may be to randomly select samples from the sample subset.
  • the method of selecting samples from the sample subset as the target sample may also be to further calculate the distance between the feature vectors represented by the samples in the sample subset, and select several samples that are far apart as the target sample.
  • the method of obtaining the manually added category label of the target sample to generate the initial training sample may be to provide the sample to the user, and determine the label of the corresponding sample based on the user's annotation information to generate the initial training sample. .
  • the preset condition includes that the information entropy value of the sample is less than the information entropy threshold; accordingly, a category label of the sample category determined according to the probability distribution data is added to the sample to generate a training sample.
  • the step includes: adding a corresponding category label to the sample according to the sample category corresponding to the maximum probability value among the probabilities represented by the probability distribution data that the sample belongs to different sample categories, to generate the training sample. .
  • the preset condition includes that the information entropy value of the sample is less than an information entropy threshold.
  • the information entropy value of the probability distribution data of the sample generated by the classification model is less than the information entropy threshold, it can be considered that the uncertainty of the probability distribution data is low and, correspondingly, the certainty is high.
  • the classification model can identify the sample category of the corresponding sample with high certainty. Therefore, a category label can be added to the sample according to the sample category corresponding to the maximum probability value in the probability distribution data to generate a training sample. The cost of manual annotation can be reduced to a certain extent.
  • the confidence level of the sample category determined according to the classification model can also be adjusted accordingly.
  • the probability distribution data of the sample with respect to multiple sample categories is generated using a classification model; accordingly, the method of generating the training sample may also include: when the information entropy value does not meet the preset conditions In this case, the manually added category label of the sample is obtained to adjust the classification model.
  • a large number of samples that can be used to train the classification model may be generated on the business system.
  • Samples generated by the business system can be used to improve the performance of the classification model.
  • the probability distribution data of the sample can be generated through a classification model. Based on the probability distribution data, the corresponding information entropy can be determined.
  • the training sample generation system can obtain the manually added category label of the sample to form a training sample for adjusting the classification model.
  • multiple preset conditions are provided corresponding to different tasks.
  • the method of generating training samples may also include: matching corresponding preset conditions according to task identifiers corresponding to different tasks.
  • different preset conditions can be provided for different tasks. Specifically, for example, for some application scenarios that have high requirements for annotation accuracy, a smaller information entropy threshold can be set to ensure the accuracy of automatic annotation to a certain extent. For some application scenarios that do not require high annotation accuracy, a relatively large information entropy threshold can be set to obtain more annotation samples. Therefore, corresponding task identifiers can be set for different preset conditions. During the operation of the training sample generation method, corresponding preset conditions can be matched according to different task identifiers to reduce the workload of the staff to a certain extent.
  • the present application provides a method for training a classification model.
  • the method may include: training the classification model using training samples obtained by the training sample generation method in any of the above embodiments.
  • the training sample generating device may include a generating module, a determining module and an adding model.
  • a generation module configured to generate probability distribution data of samples relative to multiple sample categories; wherein the probability distribution data represents the probability that a sample belongs to different sample categories.
  • a determination module configured to determine the information entropy value of the sample according to the probability distribution data.
  • This application also provides a computer device, including a memory and a processor.
  • the memory stores a computer program.
  • the processor executes the computer program, it implements the generation of training samples in any of the above embodiments. method.
  • This application also provides a computer-readable storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a computer, the computer executes the method for generating training samples in any of the above embodiments.
  • the present application also provides a computer program product containing instructions, which when executed by a computer causes the computer to perform the method for generating training samples in any of the above embodiments.
  • the size of the serial numbers of each process does not mean the order of execution.
  • the execution order of each process should be determined by its functions and internal logic, and should not be determined by the implementation process of the present application. constitute any limitation.
  • the processor of the present application may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method implementation may be completed through an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the above-mentioned processor can be a general-purpose processor, a digital signal processor (Digital SignalProcessor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital SignalProcessor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in this application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • non-volatile memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable memory Read memory (EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device implementation described above is only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of this embodiment.
  • each functional unit in each embodiment of this specification may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution in this specification may be embodied in the form of a software product in essence or the part that contributes to the existing technology or the part of the technical solution, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of this specification.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other various media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种训练样本的生成方法、装置、设备和存储介质。所述方法包括:生成样本相对于多个样本类别的概率分布数据;其中,所述概率分布数据表示样本属于不同所述样本类别的概率;根据所述概率分布数据确定所述样本的信息熵值;在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。通过样本的概率分布数据确定样本对应的信息熵。在样本的信息熵满足预设条件的情况下,自动为所述样本添加标注信息,在一定程度上提高了样本标注的效率。

Description

训练样本的生成方法、装置、设备和存储介质 技术领域
本申请涉及计算机数据处理技术领域,具体涉及一种训练样本的生成方法、装置、设备和存储介质。
发明背景
目前,一些机器学习模型的训练需要大量的具有标签的训练样本。通常的,训练样本的标签需要通过人工进行标注。
然而,人工标注存在训练样本标注效率低的技术问题。
发明内容
本申请提供了一种训练样本的生成方法、装置、设备和存储介质,以在一定程度上提高样本标注的效率。
本申请提供的一种训练样本的生成方法,包括:生成样本相对于多个样本类别的概率分布数据;其中,所述概率分布数据表示样本属于不同所述样本类别的概率;根据所述概率分布数据确定所述样本的信息熵值;在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。
本申请提供一种训练样本的生成装置,包括:生成模块,用于生成样本相对于多个样本类别的概率分布数据;其中,所述概率分布数据表示样本属于不同所述样本类别的概率;确定模块,用于根据所述概率分布数据确定所述样本的信息熵值;添加模型,用于在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。
本申请提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述的训练样本的生成方法。
本申请提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的训练样本的生成方法。
本申请通过样本的概率分布数据确定样本对应的信息熵。在样本的信息熵满足预设条件的情况下,自动为所述样本添加标注信息,在一定程度上提高了样本标注的效率。
附图简要说明
图1为本申请提供的一个场景示例中不同端交互的示意图。
图2为本申请提供的一个场景示例中不同端交互的示意图。
图3为本申请提供的训练样本的生成方法的流程示意图。
图4为本申请提供的训练样本的生成装置的示意图。
图5为本申请提供的计算机设备的示意图。
实施本申请的方式
概述
在相关技术中,大量的无标签的样本需要通过人工进行标注,以形成具有标签的训练样本,用于训练机器学习模型。具体的,每个样本需要通过显示器展示给标注人员。标注人员在查看所述样本后,依照个人经验对所述样本标注。具体的,可以标注所述样本的样本类别。这需要耗费较多的人力成本和时间成本,使得样本的标注效率较低。
因此,有必要提供一种训练样本的生成方法,可以通过少量具有标注的样本,训练分类模型。并且根据具有标注的样本和分类模型得到信息熵阈值。对于没有标签的样本,可以通过分类模型生成所述样本的概率分布数据,进一步计算得到信息熵值。在所述样本的信息熵值小于信息熵阈值的情况下,将所述分类模型预测得到样本的样本类别,作为所述训练样本的标签,解决样本效率较低的技术问题。
请参阅图1,本申请提供一种训练样本的生成系统的应用场景示例。所述系统可以包括至少一个服务器和客户端。所述服务器可以运行训练样本的生成方法。所述客户端可以用于接收用户对部分样本的标注信息。
请参阅图2,服务器可以通过爬虫技术获取到大批量的文本数据。通过对所述文本数据预处理后可以得到大量无标签的样本。接着服务器可以通过聚类方法,将所述样本划分为对应不同的样本类别的样本子集。对应每个样本子集,服务器可以分别随机抽取少量的样本,得到初始样本。
然后,服务器可以将所述初始样本发送给客户端。客户端在接收到所述初始样本后可以提供给用户,并进一步接收用户对所述初始样本的标注信息,形成初始训练样本。客户端随后可以将初始训练样本发送给服务器。其中,所述初始训练样本可以包括多个样本类别。
服务器在接收到初始训练样本后,可以通过初始训练样本对机器学习模型进行训练。其中,所述机器学习模型可以用于预测样本的样本类别。训练过程中,可以不指定验证集,并且在初始训练样本的输出损失不再下降的情况下,可以停 止迭代。
随后服务器可以通过由初始训练样本训练完成的机器学习模型,去对初始训练样本进行预测,以确定能准确预测的初始训练样本。具体的,服务器可以通过机器学习模型生成所述初始训练样本的概率分布数据。所述概率分布数据用于表示所述初始训练样本属于不同样本类别的概率。并且,基于归一化后的概率分布数据中最大的概率值,确定预测得到的样本类别。将预测得到的样本类别与标签比对,可以确定预测准确的初始训练样本。
进一步的,针对预测准确的初始训练样本,可以计算所述初始训练样本的概率分布数据的信息熵,并且可以选择所述信息熵的平均值,作为信息熵阈值。
对于无标签的样本,可以通过所述机器学习模型计算所述样本属于不同样本类别的概率分布数据。并且,在所述样本的概率分布数据的信息熵值小于所述信息熵阈值的情况下,可以将样本的概率分布数据中,对应最大概率的样本类别作为所述样本的标签,以生成训练样本。在所述样本的概率分布数据的信息熵值大于所述信息熵阈值的情况下,可以将所述样本发送给客户端以接收人工的标注信息,以生成训练样本。
最后,服务器可以通过人工标注的训练样本以及计算机自动标注的训练样本重新训练所述机器学习模型。
请参阅图1,本申请提供一种训练样本的生成系统。所述训练样本的生成系统可以包括客户端和服务器。
所述客户端可以用于为用户展示样本信息,并且可以用于接收用户对所述训练样本的标注信息。所述客户端可以是具有网络访问能力的电子设备。具体的,例如,客户端可以是台式电脑、平板电脑、笔记本电脑、智能手机、数字助理、智能可穿戴设备、导购终端、电视机、智能音箱、麦克风等。其中,智能可穿戴设备包括但不限于智能手环、智能手表、智能眼镜、智能头盔、智能项链等。或者,客户端也可以为能够运行于所述电子设备中的软件。
所述服务器可以用于执行训练样本的生成方法。服务器可以是具有一定运算处理能力的电子设备。其可以具有网络通信模块、处理器和存储器等。当然,所述服务器也可以是指运行于所述电子设备中的软体。所述服务器还可以为分布式服务器,可以是具有多个处理器、存储器、网络通信模块等协同运作的系统。或者,服务器还可以为若干服务器形成的服务器集群。或者,随着科学技术的发展,服务器还可以是能够实现说明书实施方式相应功能的新的技术手段。例如,可以是基于量子计算实现的新形态的“服务器”。
请参阅图3,本申请提供一种训练样本的生成方法。所述训练样本的生成方法可以应用于服务器。所述训练样本的生成方法可以包括以下步骤。
步骤S101:生成样本相对于多个样本类别的概率分布数据;其中,所述概率 分布数据表示样本属于不同所述样本类别的概率。
在一些实施方式中,样本相对于多个样本类别的概率分布数据,可以用于确定样本属于不同样本类别的置信程度。进一步地,在置信程度高的情况下,可以为所述样本添加标签,以在一定程度上节约人工标注的成本。
在本实施方式中,所述样本可以表示能用于生成训练样本的数据。具体的,所述样本可以仅包括表征数据,而不具有表示所述样本的样本类别的标签。所述表征数据可以是所述样本的特征。例如,所述样本可以是对话文本。相应的,所述表征数据可以是所述对话文本的文字内容。当然,所述表征数据也可以是基于所述文字内容得到的特征向量。
在本实施方式中,所述样本类别可以用于对所述样本进行分类。所述样本类别可以通过人工定义。例如,对于文本数据,可以为所述文本数据标记不同的情感类别。当然,所述样本类别也可以通过服务器自动生成。例如,通过聚类算法可以根据所述样本的特征,对所述样本划分多个类别。
在本实施方式中,所述概率分布数据可以用于表示所述样本分别属于不同样本类别的概率。其中,所述概率分布数据包括的多个概率数据可以分别对应不同的样本类别。所述概率分布数据可以是归一化后的数据,也可以是未归一化的数据。具体的,例如,基于一个样本可以生成相应的概率分布数据。所述概率分布数据可以通过一个四维向量表示:(0.2,0.05,0.1,0.65)。其中,四维向量的每一个维度可以对应一个样本类别。其中,四维向量的每个维度的数值可以用于表示所述样本属于该维度表示的样本类别的概率。其中,第四个维度的概率值可以为65%,那么可以表示所述样本属于第四个维度所表示的样本类别的概率有65%。
在本实施方式中,生成样本相对于多个样本类别的概率分布数据的方法,可以是基于机器学习模型对所述样本的表征数据预测得到。具体的,例如,所述样本可以是文本数据。所述机器学习模型可以用于预测所述文本数据的情感信息。所述机器学习模型可以输出一个包括三个概率数据的概率分布数据。其中,每个概率数据可以对应有不同的情感类别。例如,可以分别对应有“开心”“难过”以及“生气”。当然,生成样本相对于多个样本类别的概率分布数据的方法,也可以基于预设规则确定。具体的,例如,服务器可以预设有根据表示不同情感类别的词汇形成的字典库。统计文本数据中的属于不同情感类别的词汇,并且根据文本数据中属于不同情感类别的词汇的数量,可以生成所述概率分布数据。
步骤S102:根据所述概率分布数据确定所述样本的信息熵值。
在一些情况下,概率分布数据可以用于表征样本属于不同类别的概率。直接通过概率分布数据来量化样本属于某个样本类别的确定性并不显而易见。因此,可以通过概率分布数据的信息熵值来确定样本属于某一个样本类别的确定性。
在本实施方式中,根据所述概率分布数据确定所述样本的信息熵值,可以是 通过信息熵的计算方法得到。具体的,根据所述概率分布数据确定所述样本的信息熵值可以通过公式1计算得到。
Figure PCTCN2022143938-appb-000001
其中,p(x i)可以表示概率分布数据中第i个维度的概率数据。H(X)可以表示样本的概率分布数据的信息熵值。k可以表示所述概率分布数据所表示的类别的数量。
步骤S103:在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。
在一些情况下,所述信息熵值可以用于表示所述概率分布数据的不确定性。当所述信息熵值小于指定阈值的情况下,可以认为所述概率分布数据的不确定性较低。相反的,所述信息熵值小于指定阈值的情况下,所述概率分布数据的确定性可以认为较高。因此,可以将所述概率分布数据中最大的概率数据对应的样本类别,作为所述样本的标签,在一定程度上减少了标注的成本,提高了标注的效率。
在本实施方式中,所述预设条件可以表示基于概率分布数据得到的信息熵值小于指定的信息熵阈值。其中,所述信息熵阈值可以根据专家的经验确定,也可以根据基于历史样本的概率分布数据生成信息熵值的统计情况来确定。在概率分布数据得到的信息熵值小于指定的信息熵阈值条件的情况下,可以认为样本的概率分布数据所表现得样本属于某一个样本类别的不确定性较低,确定性相对较高,因此可以为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。在一些实施方式中,所述预设条件也可以为基于概率分布数据生成的信息熵值在一个指定范围内。
在本实施方式中,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本的方法,可以是将概率分布数据中最大的概率数据表示的样本类别,作为所述样本的标签。通过为所述样本添加相应的表示样本类别的标签,可以形成训练样本,以进一步训练机器学习模型,提高模型的性能。
在一些实施方式中,所述训练样本的生成方法,还可以包括:获取初始训练样本;其中,所述初始训练样本包括表征数据和表示样本类别的类别标签;根据所述初始训练样本,确定分类模型;其中,所述分类模型用于生成所述样本相对于所述多个样本类别的概率分布数据。
在一些情况下,通过部分具有类别标签的初始训练样本,可以训练一个分类模型。基于所述分类模型,可以为样本进行标注,以在一定程度上减少标注成本,提高标注效率。另外,通过基于部分具有类别标签的初始训练样本得到的分类模型,也可以用于筛选样本。对于分类模型能够较为准确地识别的样本,可以不用于加入为提高模型性能的训练集合。将一些分类模型还不能准确识别的样本,进 一步标注后训练分类模型,可以在一定程度上降低模型的训练样本数量并在一定程度上兼顾模型的训练效果。从而可以在一定程度上节约模型的训练时长。
所述初始训练样本可以具有相应的类别标签的训练样本。具体的,所述训练样本可以包括表征数据以及表示样本的样本类别的类别标签。
所述表征数据可以用于表示样本所包括的可以量化的信息。具体的,例如,对于文本类型的样本,文本信息可以作为所述表征数据。当然,基于所述文本信息生成的向量也可以作为所述表征数据。在一些实施方式中,所述表征数据可以是所述样本的特征。
所述类别标签可以表示所述初始训练样本的样本类别。例如,所述初始训练样本包括的表征数据可以是基于图像的像素值形成的三维张量。所述样本类别可以用于表示所述图像的类别。例如,所述图像可以是自然风景图像,也可以是人像图像。所述类别标签可以是一个多维向量。其中,每一个维度对应一个样本类别。例如,所述类别标签可以是一个三维度的向量,其中第一个维度的取值为1,其余维度的取值为0。所述类别标签可以表示样本属于向量的第一个维度所表示的样本类别。另外,所述初始训练样本的类别标签可以是人工标注,也可以是计算机自动标注。
所述获取初始训练样本的方法,可以是在大量无标签的样本中随机选择部分样本交由人工进行标注,形成所述初始训练样本。当然,所述初始训练样本也可以是数据库存储有的历史训练样本。相应的,所述获取初始训练样本的方法也可以是通过读取数据库得到。
所述分类模型可以是用于根据样本的表征数据对样本分类的模型。例如,所述模型可以是机器学习模型。所述分类模型可以用于根据输入的样本的表征数据,生成概率分布数据。基于所述概率分布数据,可以确定所述样本的样本类别。
相应的,根据所述初始训练样本,确定分类模型的方法,可以通过所述初始训练样本训练初始分类模型,训练完成后可以得到所述分类模型。
在一些实施方式中,所述训练样本的生成方法,还可以包括:基于所述分类模型和所述初始训练样本的表征数据,生成所述初始训练样本相对于所述多个样本类别的概率分布数据;根据所述初始训练样本对应的概率分布数据,计算所述初始训练样本对应的信息熵值,用于设定所述预设条件。
在一些情况下,通过所述初始训练样本可以用于设定所述预设条件,以用于为所述样本进行标注。
所述基于所述分类模型和所述初始训练样本的表征数据,生成所述初始训练样本相对于所述多个样本类别的概率分布数据的方法,可以将所述初始训练样本的表征数据输入分类模型,得到所述初始训练样本的概率分布数据。进一步地,基于所述概率分布数据可以预测所述初始训练样本的样本类别。并且,通过将预 测得到的所述初始训练样本的样本类别和初始训练样本的标签表示的样本类别比较,可以得到所述分类模型能够准确预测的至少部分的初始训练样本。
根据所述初始训练样本对应的概率分布数据,计算所述初始训练样本对应的信息熵值的方法,可以通过公式1计算所述初始训练样本的信息熵值。基于所述初始训练样本的信息熵值,可以设定所述预设条件。进一步的,对应所述分类模型能够准确预测的初始训练样本的信息熵值,可以设定所述预设条件。
所述预设条件可以是样本的概率分布数据的信息熵值低于指定阈值。所述信息熵阈值可以通过初始训练样本进行确定,以在一定程度上避免通过人工经验确定信息熵阈值而影响准确性。相应的,可以将初始训练样本分别对应的信息熵值进行计算,得到所述信息熵阈值。
在一些实施方式中,所述训练样本的生成方法还可以包括:将所述初始训练样本对应的信息熵值中取值最大的信息熵值,作为信息熵阈值,以用于设定所述预设条件;或者,将所述初始训练样本对应的信息熵值的平均值,作为信息熵阈值,以用于设定所述预设条件;或者,将所述初始训练样本对应的信息熵值的中位数,作为信息熵阈值,以用于设定所述预设条件。
在一些情况下,针对不同的任务需求,可以选择不同的信息熵阈值的计算方法。
在一些情况下,可以将所述初始训练样本信息熵值中取值最大的信息熵值作为信息熵阈值。相应的,所述预设条件可以表示为样本的信息熵值大于所述信息熵阈值。样本的信息熵值大于所述信息熵阈值的情况下,可以表示分类模型对于样本并不能较为准确地进行分类。相应的,在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本的方法,可以是将所述样本发送给客户端,由人工进行标注,得到对应的类别标签,以生成训练样本。通过使用最大信息熵阈值,可以将分类模型不能较好识别的部分训练样本提供给人工进行标注,以用于进一步训练所述分类模型,提高所述分类模型的识别性能。并且,将所述分类模型无法较好识别的样本来进一步训练分类模型,可以通过较少量的样本提高所述分类模型的性能。同时,在一定程度上也避免了通过分类模型能够较好识别的样本去进一步训练,节约了部分模型训练的时间。
当然,在一些实施方式中,预设条件也可以是样本的信息熵值小于所述信息熵阈值。相应的,在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本的方法,也可以是根据所述概率分布数据中取值最大的概率数据对应的样本类别,由服务器自动生成所述样本的类别标签。
在一些情况下,通过所述初始训练样本对应的信息熵值的最大值生成信息熵 阈值,并且在所述样本的信息熵值小于所述信息熵阈值的情况下由服务器自动生成所述样本的类别标签的方法,可能会因为信息熵阈值较大而在一定程度上降低样本类别标注的准确性。因此,可以通过采用初始训练样本的信息熵值的平均值,作为信息熵阈值。
在一些情况下,也可以通过采用初始训练样本的信息熵值的平均值,作为信息熵阈值。相应的,预设条件可以为所述样本的信息熵值小于所述信息熵阈值。在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本的方法,可以是将所述概率分布数据中取值最大的概率数据对应的样本类别,作为所述样本的类别标签。
在一些实施方式中,根据所述初始训练样本,确定分类模型的步骤,可以包括:以所述初始训练样本,作为初始分类模型的输入,以所述初始训练样本的类别标签,作为所述初始分类模型的目标输出,训练所述初始分类模型,得到所述分类模型,使得所述分类模型趋于完全拟合所述初始训练样本。
在一些实施方式中,在通过所述初始训练样本训练初始分类模型,得到所述分类模型的过程中,可以使得初始分类模型趋于完全拟合所述初始训练样本。在基于所述分类模型去生成概率分布数据的过程中,由于分类模型趋于完全拟合所述初始训练样本,因此可以认为在样本的特征与所述初始训练样本的特征较为接近的情况下,所述分类模型可以将所述样本正确分类。相应的,基于分类模型输出的概率分布数据,可以为所述训练样本添加类别标签,并在一定程度上确保了类别标签的准确性。虽然分类模型趋于完全拟合所述初始训练样本,在一定程度上降低了泛化性能,但是提高了通过分类模型的概率分布数据为所述样本添加类别标签的准确性。通过所述分类模型标注的样本可以进一步对所述分类模型训练以提高分类模型的性能。另外,对于未能正确分类的样本,也可以通过人工进行标注,在一定程度上弥补了泛化性能在一定程度上下降的问题。具体的,由于所述分类模型趋于完全拟合所述初始训练样本,因此根据所述分类模型得到的初始训练样本的信息熵值可能相对低。相应的,在样本的信息熵值符合低于信息熵阈值的条件,也可以表示所述样本属于所述样本类别的不确定性较小,因此为其添加的标签的准确性也相对较高。
在一些实施方式中,所述训练样本的生成方法还可以包括:对所述样本进行聚类处理,得到多个对应不同聚类簇的样本子集;在所述样本子集中选择样本,作为目标样本;获取所述目标样本被人工添加的类别标签,以生成所述初始训练样本。
在一些实施方式中,可以随机选择部分样本提供给工作人员进行标注,以得到初始训练样本。但是通过随机选择部分样本的方法,可能会导致不同样本类别的样本的占比差异较大。从而可能导致模型对部分样本类别较少的样本的识别能 力不足。因此,可以对所述样本进行聚类处理,得到多个对应不同聚类簇的样本子集。其中,每个样本子集的样本可以认为具有一定的相似性。通过在所述样本子集中选择样本,作为目标样本,可以在一定程度上保证不同样本类别的初始训练样本的数量相近,以在一定程度上保证基于所述初始训练样本训练得到的模型性能。
对所述样本进行聚类处理,得到多个对应不同聚类簇的样本子集的方法,可以是通过聚类算法对所述样本进行聚类操作。具体的,例如,所述聚类算法可以是K均值算法(k-means),也可以是最邻近节点算法(KNN)。
在所述样本子集中选择样本,作为目标样本的方法,可以是随机在样本子集中选择样本。当然,在所述样本子集中选择样本,作为目标样本的方法也可以是进一步计算样本子集中样本所表示的特征向量之间的距离,选择距离较远的若干个样本,作为目标样本。
获取所述目标样本被人工添加的类别标签,以生成所述初始训练样本的方法,可以是将样本提供给用户,并根据用户的标注信息,确定相应样本的标签,以生成所述初始训练样本。
在一些实施方式中,所述预设条件包括所述样本的信息熵值小于信息熵阈值;相应的,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本的步骤,包括:根据所述概率分布数据表示的所述样本属于不同所述样本类别的概率中,最大概率值对应的样本类别,为所述样本添加相应的类别标签,以生成所述训练样本。
在一些实施方式中,所述预设条件包括所述样本的信息熵值小于信息熵阈值。通过分类模型生成的样本的概率分布数据的信息熵值小于信息熵阈值的情况,可以认为所述概率分布数据的不确定性较低,相应的,确定性高。可以认为所述分类模型能够识别出相应样本的样本类别且确定性较高。因此,可以根据所述概率分布数据中,最大概率值对应的样本类别来为所述样本添加类别标签,以生成训练样本。在一定程度可以减少人工标注的成本。并且,通过调整所述信息熵阈值的大小,也可以相应调整根据分类模型确定的样本类别的置信程度。
在一些实施方式中,所述样本相对于多个样本类别的概率分布数据使用分类模型生成;相应的,所述训练样本的生成方法还可以包括:在所述信息熵值不符合预设条件的情况下,获取所述样本被人工添加的类别标签,以用于调整所述分类模型。
在一些实施方式中,业务系统上可能生成大量的能用于训练所述分类模型的样本。业务系统生成的样本可以用于提高所述分类模型的性能。但是,通过大量样本去训练所述分类模型,可能会需要较长的时间。因此,可以在业务系统生成的样本中,选择分类模型识别能力较差的样本来训练所述分类模型,以在一定程 度上降低模型训练的时长。具体的,可以通过分类模型来生成所述样本的概率分布数据。基于所述概率分布数据,可以确定相应的信息熵。在所述信息熵值不符合预设条件的情况下,可以包括所述信息熵值大于信息熵阈值的情况,可以认为分类模型并不能较好地识别出所述样本的类别。此时,训练样本的生成系统可以获取所述样本被人工添加的类别标签,以形成训练样本,以用于调整所述分类模型。
在一些实施方式中,对应不同任务提供有多种预设条件,所述训练样本的生成方法还可以包括:依照对应不同任务的任务标识,匹配相对应的预设条件。
在一些情况下,对应不同的任务可以提供有不同的预设条件。具体的,例如,对于一些对标注准确率具有较高需求的应用场景,可以设定较小的信息熵阈值,以在一定程度上保证自动标注的准确性。而对于一些对标注准确率要求不高的应用场景,可以设定相对较大的信息熵阈值,以获取较多的标注样本。因此,对应不同的预设条件,可以设置有相对应的任务标识。在训练样本的生成方法运行的过程中,可以根据不同的任务标识匹配对应的预设条件,以在一定程度上降低工作人员的工作量。
本申请提供了一种分类模型的训练方法,所述方法可以包括:使用上述任意一实施方式中的训练样本的生成方法得到的训练样本,训练分类模型。
请参阅图4,本申请还提供一种训练样本的生成装置。所述训练样本的生成装置可以包括生成模块、确定模块和添加模型。
生成模块,用于生成样本相对于多个样本类别的概率分布数据;其中,所述概率分布数据表示样本属于不同所述样本类别的概率。
确定模块,用于根据所述概率分布数据确定所述样本的信息熵值。
添加模型,用于在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。
请参阅图5,本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述任一实施方式中的训练样本的生成方法。
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时,该计算机执行上述任一实施方式中的训练样本的生成方法。
本申请还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得计算机执行上述任一实施方式中的训练样本的生成方法。
可以理解,本文中的具体的例子只是为了帮助本领域技术人员更好地理解本申请,而非限制本发明的范围。
可以理解,在本申请的各种实施方式中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请 的实施过程构成任何限定。
可以理解,本申请描述的各种实施方式,既可以单独实施,也可以组合实施,本申请对此并不限定。
除非另有说明,本申请所使用的所有技术和科学术语与本说明书的技术领域的技术人员通常理解的含义相同。本申请所使用的术语只是为了描述具体的实施方式的目的,不是旨在限制本说明书的范围。本说明书所使用的术语“和/或”包括一个或多个相关的所列项的任意的和所有的组合。在本申请和所附权利要求书中所使用的单数形式的“一种”、“上述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
可以理解,本申请的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施方式的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital SignalProcessor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(EEPROM)或闪存。易失性存储器可以是随机存取存储器(RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本领域普通技术人员可以意识到,结合本文中所公开的实施方式描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本说明书的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施方式中的对应过程,在此不再赘述。
在本说明书所提供的几个实施方式中,应所述理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。
另外,在本说明书各个实施方式中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本说明书的技术方案本质上或者说对现有技术做出贡献的部分或者所述技术方案的部分可以以软件产品的形式体现出来,所述计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM)、随机存取存储器(RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本说明书的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本说明书揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本说明书的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。

Claims (18)

  1. 一种训练样本的生成方法,包括:
    生成样本相对于多个样本类别的概率分布数据;其中,所述概率分布数据表示样本属于不同所述样本类别的概率;
    根据所述概率分布数据确定所述样本的信息熵值;
    在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    获取初始训练样本;其中,所述初始训练样本包括表征数据和表示样本类别的类别标签;
    根据所述初始训练样本,确定分类模型;其中,所述分类模型用于生成所述样本相对于所述多个样本类别的概率分布数据。
  3. 根据权利要求2所述的方法,其中,所述方法还包括:
    基于所述分类模型和所述初始训练样本的表征数据,生成所述初始训练样本相对于所述多个样本类别的概率分布数据;
    根据所述初始训练样本对应的概率分布数据,计算所述初始训练样本对应的信息熵值,用于设定所述预设条件。
  4. 根据权利要求3所述的方法,其中,所述方法还包括:
    将所述初始训练样本对应的信息熵值中取值最大的信息熵值,作为信息熵阈值,以用于设定所述预设条件。
  5. 根据权利要求3所述的方法,其中,所述方法还包括:
    将所述初始训练样本对应的信息熵值的平均值,作为信息熵阈值,以用于设定所述预设条件。
  6. 根据权利要求3所述的方法,其中,所述方法还包括:
    将所述初始训练样本对应的信息熵值的中位数,作为信息熵阈值,以用于设定所述预设条件。
  7. 根据权利要求2至6任一项所述的方法,其中,所述根据所述初始训练样本,确定分类模型,包括:
    以所述初始训练样本,作为初始分类模型的输入,以所述初始训练样本的类别标签,作为所述初始分类模型的目标输出,训练所述初始分类模型,得到所述分类模型,使得所述分类模型趋于完全拟合所述初始训练样本。
  8. 根据权利要求2至7任一项所述的方法,其中,所述方法还包括:
    对所述样本进行聚类处理,得到多个对应不同聚类簇的样本子集;
    在所述样本子集中选择样本,作为目标样本;
    获取所述目标样本被人工添加的类别标签,以生成所述初始训练样本。
  9. 根据权利要求2至7任一项所述的方法,其中,所述表征数据包括所述样本的特征。
  10. 根据权利要求1至9任一项所述的方法,其中,所述预设条件包括所述样本的信息熵值小于信息熵阈值;
    所述为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本,包括:
    根据所述概率分布数据表示的所述样本属于不同所述样本类别的概率中,最大概率值对应的样本类别,为所述样本添加相应的类别标签,以生成所述训练样本。
  11. 根据权利要求1所述的方法,其中,所述样本相对于多个样本类别的概率分布数据使用分类模型生成;所述方法还包括:
    在所述信息熵值不符合预设条件的情况下,获取所述样本被人工添加的类别标签,以用于调整所述分类模型。
  12. 根据权利要求1至11任一项所述的方法,其中,对应不同任务提供有多种预设条件,所述方法还包括:
    依照对应不同任务的任务标识,匹配相对应的预设条件。
  13. 根据权利要求1至12任一项所述的方法,其中,所述样本包括对话文本。
  14. 根据权利要求1至13任一项所述的方法,其中,所述概率分布数据用一个四维向量表示,所述四维向量的每个维度对应一个所述样本类别,所述四维向量的每个维度的数值用于表示所述样本属于该维度表示的样本类别的概率。
  15. 一种分类模型的训练方法,包括:
    使用如权利要求1-14中任一项所述的训练样本的生成方法得到的训练样本,训练分类模型。
  16. 一种训练样本的生成装置,包括:
    生成模块,用于生成样本相对于多个样本类别的概率分布数据;其中,所述概率分布数据表示样本属于不同所述样本类别的概率;
    确定模块,用于根据所述概率分布数据确定所述样本的信息熵值;
    添加模型,用于在所述信息熵值符合预设条件的情况下,为所述样本添加根据所述概率分布数据确定的样本类别的类别标签,以生成训练样本。
  17. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现权利要求1至15中任一项所述的方法。
  18. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机 程序被处理器执行时实现权利要求1至15中任一项所述的方法。
PCT/CN2022/143938 2022-07-29 2022-12-30 训练样本的生成方法、装置、设备和存储介质 WO2024021526A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210910055.0A CN117520836A (zh) 2022-07-29 2022-07-29 训练样本的生成方法、装置、设备和存储介质
CN202210910055.0 2022-07-29

Publications (1)

Publication Number Publication Date
WO2024021526A1 true WO2024021526A1 (zh) 2024-02-01

Family

ID=89705181

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143938 WO2024021526A1 (zh) 2022-07-29 2022-12-30 训练样本的生成方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN117520836A (zh)
WO (1) WO2024021526A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886925A (zh) * 2019-01-19 2019-06-14 天津大学 一种主动学习与深度学习相结合的铝材表面缺陷检测方法
CN110110080A (zh) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 文本分类模型训练方法、装置、计算机设备及存储介质
WO2020096099A1 (ko) * 2018-11-09 2020-05-14 주식회사 루닛 기계 학습 방법 및 장치
CN111291902A (zh) * 2020-04-24 2020-06-16 支付宝(杭州)信息技术有限公司 后门样本的检测方法、装置和电子设备
CN113569115A (zh) * 2021-02-19 2021-10-29 腾讯科技(深圳)有限公司 数据分类方法、装置、设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020096099A1 (ko) * 2018-11-09 2020-05-14 주식회사 루닛 기계 학습 방법 및 장치
CN109886925A (zh) * 2019-01-19 2019-06-14 天津大学 一种主动学习与深度学习相结合的铝材表面缺陷检测方法
CN110110080A (zh) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 文本分类模型训练方法、装置、计算机设备及存储介质
CN111291902A (zh) * 2020-04-24 2020-06-16 支付宝(杭州)信息技术有限公司 后门样本的检测方法、装置和电子设备
CN113569115A (zh) * 2021-02-19 2021-10-29 腾讯科技(深圳)有限公司 数据分类方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN117520836A (zh) 2024-02-06

Similar Documents

Publication Publication Date Title
US11537884B2 (en) Machine learning model training method and device, and expression image classification method and device
WO2020125445A1 (zh) 分类模型训练方法、分类方法、设备及介质
WO2022126971A1 (zh) 基于密度的文本聚类方法、装置、设备及存储介质
CN110069709B (zh) 意图识别方法、装置、计算机可读介质及电子设备
WO2022077646A1 (zh) 一种用于图像处理的学生模型的训练方法及装置
CN111931592B (zh) 对象识别方法、装置及存储介质
WO2021051598A1 (zh) 文本情感分析模型训练方法、装置、设备及可读存储介质
JP7403605B2 (ja) マルチターゲット画像テキストマッチングモデルのトレーニング方法、画像テキスト検索方法と装置
US9275306B2 (en) Devices, systems, and methods for learning a discriminant image representation
WO2020147409A1 (zh) 一种文本分类方法、装置、计算机设备及存储介质
WO2023272852A1 (zh) 通过决策树模型对用户进行分类的方法、装置、设备和存储介质
CN111950279B (zh) 实体关系的处理方法、装置、设备及计算机可读存储介质
CN111898550B (zh) 建立表情识别模型方法、装置、计算机设备及存储介质
CN111325156A (zh) 人脸识别方法、装置、设备和存储介质
CN111783039B (zh) 风险确定方法、装置、计算机系统和存储介质
CN112668482B (zh) 人脸识别训练方法、装置、计算机设备及存储介质
CN114444619B (zh) 样本生成方法、训练方法、数据处理方法以及电子设备
CN110569289A (zh) 基于大数据的列数据处理方法、设备及介质
CN116343233B (zh) 文本识别方法和文本识别模型的训练方法、装置
WO2020135054A1 (zh) 视频推荐方法、装置、设备及存储介质
WO2024021526A1 (zh) 训练样本的生成方法、装置、设备和存储介质
Zamzami et al. An accurate evaluation of msd log-likelihood and its application in human action recognition
US11593740B1 (en) Computing system for automated evaluation of process workflows
Song et al. Iterative 3D shape classification by online metric learning
CN114443864A (zh) 跨模态数据的匹配方法、装置及计算机程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22952955

Country of ref document: EP

Kind code of ref document: A1