CN117725420A

CN117725420A - Data set generation method and device, readable medium and electronic equipment

Info

Publication number: CN117725420A
Application number: CN202311843238.6A
Authority: CN
Inventors: 裴俊宇
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-19

Abstract

The disclosure relates to a data set generation method, a data set generation device, a readable medium and electronic equipment. The method comprises the following steps: randomly sampling a plurality of first samples from a full sample set to obtain a first sample set; obtaining a plurality of second samples with labeling information to obtain a second sample set; determining first distribution information, wherein the first distribution information is used for representing the distribution condition of the output scores of the target model after the first sample set is processed in each score interval; determining second distribution information, wherein the second distribution information is used for representing the distribution condition of the output scores of the target model after the second sample set is processed in each score interval; determining a sampling proportion according to the first distribution information and the second distribution information; and sampling from the second sample set according to the sampling proportion and the first distribution information to obtain a target data set so that the distribution condition of the output fraction of the target model after the target data set is processed in each fraction interval is matched with the first distribution information.

Description

Data set generation method and device, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data set generating method, apparatus, readable medium, and electronic device.

Background

In a machine learning scenario, there is often a problem of sample imbalance in the data set for the model learner to learn. For example, because of the uneven natural distribution of the data in different categories, in a data set containing 1000 samples, the number of samples in a certain category is only 10, or, in order to achieve a certain learning effect, a certain category or a plurality of categories of samples are selectively collected. Based on such a data set, the learner of the model may be more inclined to learn categories with a greater number of samples, resulting in unfair learning, and the resulting model may not perform adequately in practical applications.

Disclosure of Invention

This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a data set generation method, the method comprising:

randomly sampling a plurality of first samples from a full sample set to obtain a first sample set;

Obtaining a plurality of second samples with labeling information to obtain a second sample set;

determining first distribution information, wherein the first distribution information is used for representing the distribution condition of the output scores of the target model after the first sample set is processed in each score interval;

determining second distribution information, wherein the second distribution information is used for representing the distribution condition of the output fraction of the target model after the second sample set is processed in each fraction interval;

determining a sampling proportion according to the first distribution information and the second distribution information;

and sampling from the second sample set according to the sampling proportion and the first distribution information to obtain a target data set, so that the distribution condition of the output fraction of the target model after processing the target data set in each fraction interval is matched with the first distribution information.

In a second aspect, the present disclosure provides a data set generating apparatus, the apparatus comprising:

the first sampling module is used for randomly sampling a plurality of first samples from the full sample set to obtain a first sample set;

the acquisition module is used for acquiring a plurality of second samples with marking information to obtain a second sample set;

The first determining module is used for determining first distribution information, and the first distribution information is used for representing the distribution condition of the output scores of the target model after the first sample set is processed in each score interval;

the second determining module is used for determining second distribution information, and the second distribution information is used for representing the distribution condition of the output scores of the target model after the second sample set is processed in each score interval;

the third determining module is used for determining a sampling proportion according to the first distribution information and the second distribution information;

and the second sampling module is used for sampling from the second sample set according to the sampling proportion and the first distribution information to obtain a target data set, so that the distribution condition of the output fraction of the target data set processed by the target model in each fraction interval is matched with the first distribution information.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having at least one computer program stored thereon;

At least one processing means for executing the at least one computer program in the storage means to carry out the steps of the method of the first aspect of the present disclosure.

According to the technical scheme, the first sample set is obtained by randomly sampling the total sample set, the second sample set with the marking information is obtained, the first distribution information and the second distribution information used for representing the distribution condition of the output fraction of the target model after the first sample set and the second sample set are processed in each fraction interval are determined, the sampling proportion is determined according to the first distribution information and the second distribution information, and the target data set with the distribution condition conforming to the first distribution information is obtained by sampling the second sample set according to the sampling proportion. The first sample set randomly sampled from the full sample set can reflect the sample distribution condition of the whole samples, the obtained first distribution information can reflect the score distribution condition obtained after the target model processes the first sample set, and further, according to the sampling proportion determined by the difference between the second distribution information and the first distribution information, the score distribution of the target data set sampled from the second sample set with the labeling information based on the sampling proportion in the target model is consistent with the first distribution information, which is equivalent to realizing similar sample distribution as the first sample set in the target data set. Based on the method, the target data set can reflect the sample distribution condition of all samples, and the target model can be tested more accurately based on the target data set, so that the problem that the model only shows excellent performance on the test set but shows poor performance in an actual application scene is effectively avoided.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart of a data set generation method provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a block diagram of a data set generating device provided in accordance with one embodiment of the present disclosure;

fig. 3 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

All actions in this disclosure to obtain signals, information or data are performed in compliance with the corresponding data protection legislation policies of the country of location and to obtain authorization granted by the owner of the corresponding device.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

Meanwhile, it can be understood that the data (including but not limited to the data itself, the acquisition or the use of the data) related to the technical scheme should conform to the requirements of the corresponding laws and regulations and related regulations.

As described in the background, in machine learning scenarios, data sets used for model training often suffer from sample imbalance. In the training scenario of models, particularly complex models, the set of available labeled models for performing model testing is often a biased dataset, because the training set is often a biased dataset for good training results, and the test set is often a simple division of the training set, thus preserving the sample distribution ratio (e.g., the distribution ratio of positive and negative samples) consistent with the training set. Thus, models derived based on training sets often still have excellent performance on test sets, but in practical application scenarios, i.e. when applied to unbiased full data sets, there are often situations where performance is poor. This results in that the model test based on the test set described above does not accurately reflect the performance of the model.

In order to solve the technical problems, the present disclosure provides a data set generating method, a device, a readable medium and an electronic apparatus.

Unbiased estimation is an unbiased inference when sample statistics are used to estimate overall parameters. If the mathematical expectation of an estimator is equal to the true value of the estimated parameter, then the estimator is said to be an unbiased estimate of the estimated parameter, with unbiasedness, a criterion for evaluating the merit of the estimator. From this definition it is possible to derive: in a narrow sense, an unbiased dataset may refer to a dataset derived from a sample to be evaluated by a series of random strategies such as random sampling, stratified sampling, etc.; in a broad sense, an unbiased dataset may refer to a dataset that can ensure that a sample is consistent or very small in gap with the sample to be evaluated in all features.

Based on the definition of the generalized unbiased data set, the analysis results that the random data set can not only be obtained from a large sample (namely, a full sample), but also achieve the purpose of unbiased data set as long as the distribution of the generated data set is consistent with the large sample. Based on this conclusion, the main idea of the present disclosure to generate a dataset is to resample from all existing annotated sample sets, combining to form a set consistent with a large sample distribution.

Based on this, the present disclosure will be described in detail.

Fig. 1 is a flowchart of a data set generation method provided according to one embodiment of the present disclosure. As shown in fig. 1, the method may include steps 11 to 16.

In step 11, a plurality of first samples are randomly sampled from a full sample set, resulting in a first sample set.

In step 12, a plurality of second samples with labeling information are obtained, resulting in a second sample set.

It is noted that the object of the present disclosure is to determine a target data set having a better performance evaluation effect on a target model, so that all samples related to the target model in the present disclosure are samples, that is, a full sample set is a full sample corresponding to the target model, and a plurality of second samples with labeling information are samples corresponding to the target model and with labeling information.

The first sample set is obtained by randomly sampling from a full sample set, corresponding to an unbiased sample set. The second sample set may be the marked portion of the full sample. The second sample may be obtained by integrating a historical training set and a historical testing set of the target model.

It should be noted that, the steps 11 and 12 are not strictly executed in sequence, and may be executed sequentially or simultaneously, which is not limited in this disclosure.

In step 13, first distribution information is determined.

The first distribution information is used for representing the distribution condition of the output scores of the target model after the first sample set is processed in each score interval.

The score of the model is a comprehensive score result obtained by evaluating various different features through the model, so that the score output by the model aiming at a sample can reflect the overall situation of the features to a certain extent, and the distribution of the output scores of the model is determined, so that each feature does not need to be evaluated respectively, the implementation is simple, and the efficiency is high.

The score interval is at least two score intervals formed by dividing the score interval of [0,1] as a whole. For example, the fractional intervals may include two fractional intervals of [0,0.5 ] and [0.5,1], which corresponds to the division by positive and negative samples. The division of the fractional intervals can be flexibly divided according to actual requirements, and the more the number of the fractional intervals is, the more the final unbiased effect is facilitated.

In one possible embodiment, step 13 may comprise the steps of:

inputting the first samples into a target model aiming at each first sample in the first sample set to obtain a first fraction output by the target model;

and determining the first number of the first samples corresponding to each score interval according to the score interval in which each first score is located, so as to obtain first distribution information.

For each first sample in the first set of samples, the first sample may be input into the target model, respectively, and the target model may output a score corresponding to the first sample, i.e., a first score. In this way, a respective first score for each first sample in the first set of samples may be obtained.

Based on the first score corresponding to each first sample, and the score intervals are combined, it can be determined in which score interval the score of each first sample falls, and then the number of the corresponding samples in each score interval can be determined, that is, the first number of the first samples corresponding to each score interval, that is, the first distribution information is determined.

Based on the above, the obtained first distribution information can reflect the score distribution condition of the unbiased data set after the target model processing, and if the score distribution condition of the data set after the target model processing can be matched with the first distribution information, the data set can be considered to be unbiased.

In step 14, second distribution information is determined.

The second distribution information is used for representing the distribution condition of the output scores of the target model after the second sample set is processed in each score interval. The score interval here is the same as that in the first distribution information.

In one possible embodiment, step 14 may comprise the steps of:

inputting the second sample into the target model aiming at each second sample in the second sample set to obtain a second fraction output by the target model;

and determining the second number of the second samples corresponding to each fractional interval according to the fractional interval in which each second fraction is located, so as to obtain second distribution information.

For each second sample in the second sample set, the second sample may be input into the target model, respectively, and the target model may output a score corresponding to the second sample, i.e., a second score. In this way, a respective second score for each second sample in the second set of samples may be obtained.

Based on the second score corresponding to each second sample, and the score intervals are combined, it can be determined in which score interval the score of each second sample falls, and then the number of the corresponding samples in each score interval can be determined, that is, the second number of the second samples corresponding to each score interval, that is, the second distribution information.

Based on the above, the obtained second distribution information can reflect the score distribution condition of the second sample set after the target model processing, and further, the difference between the score distribution condition and the first distribution information can be determined in the subsequent processing, so that the distribution of the samples can be adjusted in a targeted manner, and the purpose of finally generating an unbiased data set is achieved.

It should be noted that, the steps 13 and 14 are not strictly executed in sequence, and may be executed sequentially or simultaneously, which is not limited in this disclosure.

In step 15, a sampling ratio is determined based on the first distribution information and the second distribution information.

As described above, the first distribution information may include a first number of first samples corresponding to each fractional interval, and the second distribution information may include a second number of second samples corresponding to each fractional interval.

The above-mentioned process of determining the sampling proportion is to determine how to sample according to the difference condition of the second distribution information relative to the first distribution information, so that the sampled data set can conform to the first distribution information to form an unbiased data set.

In one possible embodiment, step 15 may comprise the steps of:

For each fractional interval, determining a ratio between a second number corresponding to the fractional interval and a first number corresponding to the fractional interval;

and determining the sampling proportion according to the minimum value in the ratio corresponding to each fractional interval.

In one possible embodiment, the minimum value of the respective ratios of each fractional interval may be directly determined as the sampling ratio.

For example, if the fractional interval includes two fractional intervals of [0,0.5 ] and [0.5,1], that is, the fractional interval corresponding to each of the negative sample and the positive sample in the two-classification model, if the first number corresponding to the fractional interval of [0,0.5 ] is 90, the second number is 5, and the first number corresponding to the fractional interval of [0.5,1] is 10, the second number is 5, the ratio corresponding to the fractional interval of [0,0.5 ] is 5/90, and the ratio corresponding to the fractional interval of [0.5,1] is 5/10, wherein the minimum value is 5/90, so that the sampling ratio can be determined to be 5/90, that is, 1/18.

By taking the minimum value in the ratio corresponding to each fractional interval as the sampling ratio, each fractional interval can be ensured to be taken as the required data.

In another possible embodiment, determining the sampling proportion according to the minimum value in the respective corresponding ratio of each fraction interval may include the following steps:

Acquiring at least one reference proportion, wherein the reference proportion is determined according to third distribution information and second distribution information, the third distribution information is used for representing the distribution condition of output scores of a target model after processing a third sample set in each score interval, and the third sample set is obtained by randomly sampling a plurality of samples from a total sample set;

the sampling ratio is determined based on the minimum value and at least one reference ratio.

The manner of determining the third distribution information is the same as that of determining the first distribution information, and each reference proportion is determined based on one third distribution information and one second distribution information, and the manner of determining the third distribution information is the same as that of determining the sampling proportion based on the first distribution information and the second distribution information, which are not described herein again.

A third sample set is obtained by randomly sampling a plurality of samples from a total number of samples, and may also be used as an unbiased sample set. This approach is equivalent to sampling a different unbiased sample set as an unbiased reference for the second distribution information, which is beneficial to better unbiased of the final formed target data set.

For example, based on the minimum value and the at least one reference ratio, a mean or median of the minimum value and the at least one reference ratio may be determined as the sampling ratio.

In this way, by sampling a plurality of unbiased sample sets, which are commonly used to determine the sampling proportion, a more excellent unbiased target data set is advantageously formed.

In step 16, a target data set is sampled from the second sample set according to the sampling proportion and the first distribution information, so that the distribution condition of the output fraction of the target data set processed by the target model in each fraction interval is matched with the first distribution information.

In one possible embodiment, step 16 may comprise the steps of:

for each fractional interval, determining the product of the first quantity corresponding to the fractional interval and the sampling proportion as a target quantity;

sampling a target number of second samples in a second number of second samples corresponding to each fractional interval as a target sample corresponding to the fractional interval;

and generating a target data set according to the target samples corresponding to each fractional interval.

For each fractional interval, determining the product of the first quantity corresponding to the fractional interval and the sampling proportion as a target quantity. Thus, the target number of each fractional interval can simulate the number distribution proportion of the first sample set in each fractional interval.

For example, if the fractional interval includes two fractional intervals of [0,0.5 ] and [0.5,1], the first number corresponding to the fractional interval of [0,0.5 ] is 60, the second number is 5, and the first number corresponding to the fractional interval of [0.5,1] is 40, the second number is 5, it may be determined that the sampling ratio is 5/60, that is, 1/12, and the target number corresponding to the fractional interval of [0,0.5 ] is (1/12) ×60=5, and the target number corresponding to the fractional interval of [0.5,1] is (1/12) ×40=10/3, so that sampling may be performed in the fractional interval of [0,0.5 ] based on the target number 5, and sampling may be performed in the fractional interval of [0.5,1] based on the target number 10/3 (may be rounded to 3 or 4 as needed).

And then, according to the target number corresponding to each fractional interval, sampling the samples with the target number from the second samples with the second number in the corresponding fractional interval as the target samples corresponding to the fractional interval. Thus, the target samples corresponding to all the fractional intervals are integrated together through the union set, and the target data set is generated.

Optionally, after generating the target data set, the method provided by the present disclosure may further include the steps of:

the target data set is used as one of test sets of the target model, and the test sets are used for model testing aiming at the target model.

That is, the target data set generated by the scheme of the present disclosure is used as one of the test sets of the target model for testing and evaluating the target model. Alternatively, other test sets may be generated in the manner provided by the present disclosure.

Fig. 2 is a block diagram of a data set generating apparatus provided according to one embodiment of the present disclosure. As shown in fig. 2, the apparatus 20 includes:

a first sampling module 21, configured to randomly sample a plurality of first samples from a full sample set, to obtain a first sample set;

an obtaining module 22, configured to obtain a plurality of second samples with labeling information, to obtain a second sample set;

the first determining module 23 is configured to determine first distribution information, where the first distribution information is used to characterize a distribution condition of output scores of the target model after processing the first sample set in each score interval;

a second determining module 24, configured to determine second distribution information, where the second distribution information is used to characterize a distribution situation of output scores after the target model processes the second sample set in each score interval;

a third determining module 25, configured to determine a sampling proportion according to the first distribution information and the second distribution information;

and the second sampling module 26 is configured to sample the second sample set according to the sampling proportion and the first distribution information to obtain a target data set, so that the distribution condition of the output fraction of the target model after processing the target data set in each fraction interval is matched with the first distribution information.

Optionally, the first determining module 23 includes:

a first processing sub-module, configured to input, for each of the first samples in the first sample set, the first sample to the target model, to obtain a first score output by the target model;

the first determining submodule is used for determining the first number of the first samples corresponding to each fractional interval according to the fractional interval in which each first fraction is located, so as to obtain the first distribution information.

Optionally, the second determining module 24 includes:

a second processing sub-module, configured to input, for each of the second samples in the second sample set, the second sample to the target model, to obtain a second score output by the target model;

and the second determining submodule is used for determining the second number of the second samples corresponding to each score interval according to the score interval in which each second score is located so as to obtain the second distribution information.

Optionally, the first distribution information includes a first number of first samples corresponding to each of the fractional spans, and the second distribution information includes a second number of second samples corresponding to each of the fractional spans;

The third determining module 25 includes:

a third determining submodule, configured to determine, for each fractional interval, a ratio between a second number corresponding to the fractional interval and a first number corresponding to the fractional interval;

and the fourth determining submodule is used for determining the sampling proportion according to the minimum value in the ratio corresponding to each fractional interval.

Optionally, the fourth determining sub-module includes:

the acquisition sub-module is used for acquiring at least one reference proportion, the reference proportion is determined according to third distribution information and the second distribution information, the third distribution information is used for representing the distribution condition of output scores of the target model after processing a third sample set in each score interval, and the third sample set is obtained by randomly sampling a plurality of samples from the full sample set;

and a fifth determining submodule, configured to determine the sampling proportion according to the minimum value and the at least one reference proportion.

Optionally, the fifth determining submodule is configured to determine a mean or median of the minimum value and the at least one reference proportion as the sampling proportion.

The second sampling module 26 includes:

a sixth determining submodule, configured to determine, for each fractional interval, a product of a first number corresponding to the fractional interval and the sampling proportion as a target number;

the sampling sub-module is used for sampling a second sample with a target number from a second sample with a second number corresponding to each fractional interval to serve as a target sample corresponding to the fractional interval;

and the generation sub-module is used for generating the target data set according to the target samples corresponding to each score interval.

Optionally, the apparatus 20 further comprises:

and the fourth determining module is used for taking the target data set as one of test sets of the target model, wherein the test sets are used for carrying out model test on the target model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Referring now to fig. 3, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 3 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 3, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: randomly sampling a plurality of first samples from a full sample set to obtain a first sample set; obtaining a plurality of second samples with labeling information to obtain a second sample set; determining first distribution information, wherein the first distribution information is used for representing the distribution condition of the output scores of the target model after the first sample set is processed in each score interval; determining second distribution information, wherein the second distribution information is used for representing the distribution condition of the output fraction of the target model after the second sample set is processed in each fraction interval; determining a sampling proportion according to the first distribution information and the second distribution information; and sampling from the second sample set according to the sampling proportion and the first distribution information to obtain a target data set, so that the distribution condition of the output fraction of the target model after processing the target data set in each fraction interval is matched with the first distribution information.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and for example, the first sampling module may also be described as "a module that obtains a first sample set from randomly sampling a plurality of first samples in a full sample set".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a data set generation method including:

According to one or more embodiments of the present disclosure, there is provided a data set generating method, the determining first distribution information including:

Inputting the first samples to the target model for each first sample in the first sample set to obtain a first score output by the target model;

and determining the first number of the first samples corresponding to each score interval according to the score interval in which each first score is located, so as to obtain the first distribution information.

According to one or more embodiments of the present disclosure, there is provided a data set generating method, the determining second distribution information including:

inputting the second samples into the target model for each second sample in the second sample set to obtain a second fraction output by the target model;

and determining the second number of the second samples corresponding to each score interval according to the score interval in which each second score is located, so as to obtain the second distribution information.

According to one or more embodiments of the present disclosure, there is provided a data set generating method, wherein the first distribution information includes a first number of first samples corresponding to each of the fractional spans, and the second distribution information includes a second number of second samples corresponding to each of the fractional spans;

The determining the sampling proportion according to the first distribution information and the second distribution information includes:

According to one or more embodiments of the present disclosure, there is provided a data set generating method, wherein the determining the sampling proportion according to a minimum value in the ratio corresponding to each of the fractional spans includes:

acquiring at least one reference proportion, wherein the reference proportion is determined according to third distribution information and the second distribution information, the third distribution information is used for representing the distribution condition of output scores of the target model after processing a third sample set in each score interval, and the third sample set is obtained by randomly sampling a plurality of samples from the total sample set;

the sampling ratio is determined based on the minimum value and the at least one reference ratio.

According to one or more embodiments of the present disclosure, there is provided a data set generating method, the determining the sampling proportion according to the minimum value and the at least one reference proportion, including:

And determining the average value or the median of the minimum value and the at least one reference proportion as the sampling proportion.

the sampling from the second sample set according to the sampling proportion and the first distribution information to obtain a target data set includes:

and generating the target data set according to the target samples corresponding to each score interval.

According to one or more embodiments of the present disclosure, there is provided a data set generating method, the method further comprising:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, the apparatus comprising:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, the first determining module including:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, the second determining module including:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, wherein the first distribution information includes a first number of first samples corresponding to each of the fractional spans, and the second distribution information includes a second number of second samples corresponding to each of the fractional spans;

the third determining module includes:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, the fourth determination submodule including:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, the fifth determining submodule is configured to determine a mean or median of the minimum value and the at least one reference proportion as the sampling proportion.

the second sampling module includes:

According to one or more embodiments of the present disclosure, there is provided a data set generating apparatus, the apparatus further comprising:

According to one or more embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the data set generation method provided by any embodiment of the present disclosure.

According to one or more embodiments of the present disclosure, there is provided an electronic device including:

a storage device having at least one computer program stored thereon;

at least one processing means for executing the at least one computer program in the storage means to implement the steps of the data set generation method provided by any embodiment of the present disclosure.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of generating a data set, the method comprising:

2. The method of claim 1, wherein the determining the first distribution information comprises:

3. The method of claim 1, wherein the determining the second distribution information comprises:

4. The method of claim 1, wherein the first distribution information comprises a first number of first samples corresponding to each of the fractional spans, and the second distribution information comprises a second number of second samples corresponding to each of the fractional spans;

5. The method of claim 4, wherein determining the sampling ratio based on a minimum of the respective ratios for each of the fractional spans comprises:

6. The method of claim 5, wherein said determining said sampling ratio from said minimum value and said at least one reference ratio comprises:

7. The method of claim 1, wherein the first distribution information comprises a first number of first samples corresponding to each of the fractional spans, and the second distribution information comprises a second number of second samples corresponding to each of the fractional spans;

8. The method according to any one of claims 1-7, further comprising:

9. A data set generating apparatus, the apparatus comprising:

10. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-8.

11. An electronic device, comprising:

a storage device having at least one computer program stored thereon;

at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-8.