CN112085080A - Sample equalization method, device, equipment and storage medium - Google Patents

Sample equalization method, device, equipment and storage medium Download PDF

Info

Publication number
CN112085080A
CN112085080A CN202010899784.1A CN202010899784A CN112085080A CN 112085080 A CN112085080 A CN 112085080A CN 202010899784 A CN202010899784 A CN 202010899784A CN 112085080 A CN112085080 A CN 112085080A
Authority
CN
China
Prior art keywords
target
samples
sample set
labels
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010899784.1A
Other languages
Chinese (zh)
Other versions
CN112085080B (en
Inventor
杨晨
杨天行
彭彬
张一麟
宋勋超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010899784.1A priority Critical patent/CN112085080B/en
Publication of CN112085080A publication Critical patent/CN112085080A/en
Application granted granted Critical
Publication of CN112085080B publication Critical patent/CN112085080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

The application discloses a sample equalization method, a sample equalization device, sample equalization equipment and a storage medium, relates to the technical field of data processing, and particularly relates to deep learning, artificial intelligence and intelligent search technologies. The specific implementation scheme is as follows: determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, and taking the sample corresponding to the target label in the sample set to be balanced as a target sample; increasing the target samples to enable the number of samples corresponding to the target labels in the sample set to be equalized to reach the number of target samples, and obtaining a new sample set; and if the number of samples corresponding to the other labels except the target label in the new sample set is smaller than the number of the target samples, adding the samples corresponding to the other labels except the target label in the new sample set. According to the technology of the application, the balance degree of the number of the samples corresponding to each label in the sample set is improved.

Description

Sample equalization method, device, equipment and storage medium
Technical Field
The application relates to the technical field of data processing, in particular to deep learning, artificial intelligence and intelligent search technologies. In particular, the application provides a sample equalization method, a sample equalization device, sample equalization equipment and a storage medium.
Background
In the process of training the multi-classification model, due to the particularity of the multi-label classification task, the data sample of the multi-label classification task labeled by a customer generally cannot meet the sample balance requirement of model training. The model obtained through unbalanced sample training often wrongly classifies data into label categories with high sample ratio, so that model classification errors are caused.
Disclosure of Invention
The disclosure provides a sample equalization method, apparatus, device and storage medium.
According to an aspect of the present disclosure, there is provided a sample equalization method, including:
determining a target label to be equalized from at least two labels associated with a sample set to be equalized according to the number of samples corresponding to the labels in the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label;
increasing the target samples to enable the number of samples corresponding to the target labels in the sample set to be equalized to reach the number of target samples, and obtaining a new sample set;
if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, so that the number of samples corresponding to the other labels in the sample set to be equalized reaches the number of the target samples.
According to another aspect of the present disclosure, there is provided a sample equalizing apparatus including:
the label determining module is used for determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, and taking the sample corresponding to the target label in the sample set to be balanced as a target sample, wherein the sample set to be balanced comprises at least two samples, and each sample has at least one label;
a first adding module, configured to add the target samples, so that the number of samples corresponding to the target tag in the sample set to be equalized reaches a target sample number, and a new sample set is obtained;
a second adding module, configured to add, if the number of samples corresponding to the other tags in the new sample set except the target tag is smaller than the number of target samples, samples corresponding to the other tags in the new sample set except the target tag, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples.
According to still another aspect of the present disclosure, there is provided an electronic apparatus, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present application.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of the embodiments of the present application.
According to the technology of the application, the balance degree of the number of the samples corresponding to each label in the sample set is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a flowchart of a sample equalization method provided in an embodiment of the present application;
fig. 2 is a flowchart of another sample equalization method provided in the embodiments of the present application;
fig. 3 is a flowchart of another sample equalization method provided in the embodiment of the present application;
fig. 4 is a flowchart of another sample equalization method provided in the embodiment of the present application;
fig. 5 is a flowchart of another sample equalization method provided in the embodiment of the present application;
fig. 6 is a schematic diagram of a sample equalization apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device of a sample equalization method according to an embodiment of the application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a flowchart of a sample equalization method according to an embodiment of the present disclosure. The embodiment can be applied to the condition of training samples of the balanced multi-label classification model. The method may be performed by a sample equalization apparatus. The means may be implemented by means of software and/or hardware. Referring to fig. 1, a sample equalization method provided in an embodiment of the present application includes:
s110, according to the number of samples corresponding to the labels in a sample set to be equalized, determining a target label to be equalized from at least two labels associated with the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label.
The sample set to be equalized is a sample set to be equalized, that is, the distribution of samples of different labels in the sample set is not balanced.
The at least two labels associated with the set of samples to be equalized refer to the labels that the samples in the set of samples to be equalized have.
Illustratively, taking the text to be classified as a legal document as an example, the labels may be a traffic incident and a non-traffic incident, or may be a traffic incident, a property dispute, an illegal financing, and the like.
The target label refers to the label of the sample to be equalized.
In one embodiment, the determination of the target tag may include:
and taking the label corresponding to the minimum sample number as a target label according to the sample number corresponding to the label in the sample set to be balanced.
Optionally, in another embodiment, the determining of the target tag may also include:
and according to the number of samples corresponding to the labels in the sample set to be equalized, taking the labels corresponding to the number of samples of which the number is less than a set number threshold value as target labels.
The target sample is a sample corresponding to the target label in the sample set to be equalized, that is, a sample of the target label marked in the sample set to be equalized.
And S120, increasing the target samples to enable the number of the samples corresponding to the target labels in the sample set to be balanced to reach the number of the target samples, and obtaining a new sample set.
The target sample number refers to the number of samples that the target label should have after equalization.
Optionally, the determining of the number of target samples comprises:
and determining the number of the target samples according to the maximum value or the average value of the number of the samples corresponding to each label in the sample set to be equalized.
In one embodiment, adding the target sample may include:
inputting a target sample into a pre-trained model, and adding similar samples of the output target sample into a sample set to be balanced; alternatively, the first and second electrodes may be,
and copying a target sample, and adding the copied sample to a sample set to be equalized.
The new sample set is the sample set to be equalized, wherein the number of the target samples reaches the target sample number after the target samples are added.
S130, if the number of the samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, and enabling the number of the samples corresponding to the other labels in the sample set to be equalized to reach the number of the target samples.
Wherein, the other labels refer to the labels in the new sample set except the target label.
The samples corresponding to the other labels except the target label in the new sample set refer to samples marked with other labels in the new sample set.
According to the technical scheme, if the number of the samples corresponding to the other labels except the target label in the new sample set is smaller than the number of the target samples, the samples corresponding to the other labels except the target label in the new sample set are added, so that the balance of the other labels except the target label is realized, the balance of the samples of the multi-label sample set is realized, and the balance degree of the multi-label sample set is improved.
Fig. 2 is a flowchart of another sample equalization method provided in the embodiment of the present application. The scheme is an extension of the scheme on the basis of the scheme. Referring to fig. 2, a sample equalization method provided in the embodiment of the present application includes:
s210, according to the number of samples corresponding to the labels in a sample set to be equalized, determining a target label to be equalized from at least two labels associated with the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label.
S220, adding the target samples, and enabling the number of the samples corresponding to the target labels in the sample set to be balanced to reach the number of the target samples to obtain a new sample set.
And S230, counting the number of samples of other labels except the target label in the new sample set based on the added target samples.
When the target specimen is a multi-labeled specimen, the target specimen has other labels in addition to the target label.
For example, the target sample is a first sentence and the target tag is a property dispute. The other tags that the first sentence has may be illegal financing, etc.
S240, if the number of the samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, so that the number of the samples corresponding to the other labels in the sample set to be equalized reaches the number of the target samples.
In an embodiment, if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the target sample number, adding the samples corresponding to the other labels in the new sample set except the target label to make the number of samples corresponding to the other labels in the sample set to be equalized reach the target sample number includes:
if the number of samples corresponding to other labels in the new sample set except the target label is smaller than the number of the target samples, determining a label to be equalized from the other labels and a sample to be equalized corresponding to the label to be equalized in the new sample set according to the number of samples corresponding to the other labels and the number of the target samples;
and adding the samples to be equalized to enable the number of the samples of the labels to be equalized in the new sample set to reach the target number of the samples.
The label to be equalized is a label to be equalized.
The sample to be equalized refers to a sample having a label to be equalized in the new sample set.
In one embodiment, determining a label to be equalized from the other labels according to the number of samples corresponding to the other labels and the number of target samples includes:
respectively comparing the number of samples of other labels except the target label in the new sample set with the number of target samples;
and if the number of the samples of the other labels is less than the target number of the samples, taking the other labels as the labels to be equalized.
According to the scheme, the number of samples corresponding to other labels except the target label in the new sample set is counted based on the added target samples; and triggering sample equalization processing on other labels according to the statistical result so as to realize sample equalization on other labels except the target label in the sample set.
Fig. 3 is a flowchart of another sample equalization method provided in the embodiment of the present application. According to the scheme, on the basis of the scheme, the target samples are added in the step, so that the number of the samples corresponding to the target labels in the sample set to be balanced reaches the number of the target samples, and the specific optimization of a new sample set is obtained. Referring to fig. 3, a sample equalization method provided in the embodiment of the present application includes:
s310, according to the number of samples corresponding to the labels in a sample set to be equalized, determining a target label to be equalized from at least two labels associated with the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label.
S320, determining the difference value between the number of the target labels in the sample set to be equalized and the number of the target samples.
S330, increasing the target samples according to the determined difference value to enable the number of the samples of the target labels to reach the number of the target samples, and obtaining a new sample set.
S340, if the number of the samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, so that the number of the samples corresponding to the other labels in the sample set to be equalized reaches the number of the target samples.
According to the scheme, the number of the samples of the target labels in the sample set to be balanced is increased to the number of the target samples according to the difference value between the number of the samples of the target labels in the sample set to be balanced and the number of the target samples, so that the samples of the target labels in the sample set to be balanced are increased to the number of the target samples.
Fig. 4 is a flowchart of another sample equalization method provided in the embodiment of the present application. According to the scheme, on the basis of the scheme, the target samples are added in the step, so that the number of the samples corresponding to the target labels in the sample set to be balanced reaches the number of the target samples, and the specific optimization of a new sample set is obtained. Referring to fig. 3, a sample equalization method provided in the embodiment of the present application includes:
s410, according to the number of samples corresponding to the labels in a sample set to be equalized, determining a target label to be equalized from at least two labels associated with the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label.
And S420, if the types of the target samples are at least two and the number of the target samples is at least two, adding at least two target samples according to the number of the target samples, so that the number of the target labels in the sample set to be equalized reaches the number of the target samples, and obtaining a new sample set.
Wherein the at least two target samples comprise two, three or all kinds of target samples.
Exemplarily, if the samples of the target tag in the sample set to be equalized are the first sample and the second sample, the first sample and the second sample are respectively copied, and the copied first sample and second sample are added to the sample set to be equalized, so that the number of the first sample and the number of the second sample in the sample set to be equalized are equal, and the richness of the sample set to be equalized is further improved.
In an embodiment, the adding at least two target samples according to the target sample number to make the sample number of the target tag in the sample set to be equalized reach the target sample number to obtain a new sample set includes:
determining the increased number of various target samples in the at least two target samples according to the number of the target samples, wherein the difference value between the increased number of the various target samples is smaller than a set difference threshold value;
and increasing the at least two target samples according to the determined increasing number of the various target samples to obtain a new sample set.
The set difference threshold may be determined according to actual needs, which is not limited in this embodiment.
Illustratively, the number of past target samples is 8, the type of the target samples is 2, and the number of target samples included in the set of samples to be equalized before the equalization processing is 2, then it is determined that the increased number of each target sample is 3, and the calculation formula is: (8-2) ÷ 2.
S430, if the number of the samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, so that the number of the samples corresponding to the other labels in the sample set to be equalized reaches the number of the target samples.
According to the scheme, if the types of the target samples are at least two, and the number of the target samples is at least two, the at least two types of target samples are added according to the number of the target samples, so that the richness of the new sample concentrated samples is improved.
Fig. 5 is a flowchart of another sample equalization method provided in the embodiment of the present application. The embodiment is an extension of the above scheme on the basis of the above embodiment. Referring to fig. 5, a sample equalization method provided in the embodiment of the present application includes:
s510, according to the number of samples corresponding to the labels in a sample set to be equalized, determining a target label to be equalized from at least two labels associated with the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label.
S520, adding the target samples, and enabling the number of the samples corresponding to the target labels in the sample set to be balanced to reach the number of the target samples to obtain a new sample set.
S530, if the number of the samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, so that the number of the samples corresponding to the other labels in the sample set to be equalized reaches the number of the target samples.
And S540, determining the high-frequency label of the new sample set.
The high-frequency label is a label with the sample appearance frequency higher than that of other labels in the sample set.
In one embodiment, the determining the high frequency label of the new sample set comprises:
determining the difference value between the number of samples of each label in the new sample set and the number of the target samples;
determining the high frequency tag from the at least two tags associated with the sample set according to the determined difference.
For example, the label corresponding to the sample number with the largest difference is taken as the high-frequency label.
Optionally, the determining the high frequency label of the added sample set includes:
determining the number difference of samples among different labels in the new sample set;
determining the high frequency tag from the at least two tags associated with the sample set according to the determined difference.
And S550, determining the number of the tags as one from the new sample set, and marking the high-frequency samples with the high-frequency tags.
Wherein, the high frequency sample is a sample only marked with a high frequency label.
And S560, reducing the number of the high-frequency samples to reduce the difference value between the number of the high-frequency label samples and the number of the other labels except the high-frequency label in the new sample set.
In one embodiment, the high frequency samples may be deleted by a set number.
In order to improve the deleting accuracy, the high-frequency samples can be deleted according to the difference between the number of the high-frequency label samples and the number of the target samples.
Taking the sample in the sample set to be equalized as the multi-label sample as an example, the following details are provided:
sentence1 a,b,c
sentence2 a,c
sentence3 a,d
sentence4 a,e
wherein, the sensor 1, sensor 2, sensor 3 and sensor 4 are different samples, and a, b, c, d and e are different labels of the samples. Here, the target sample number is set to 4, and the sample number corresponding to each label before the equalization processing is: a:4, b:1, c:2, d:1, e: 1. The sample equalization method provided by the embodiment of the application is used for sample equalization, and the specific process is as follows:
taking the label with the minimum sample number as a first label to be equalized, taking e as an example, copying the content 4, and adding the copied 3 content 4 to the sample set to obtain a first sample set;
and performing statistics again on the first sample set to obtain a first statistical result as follows: a:7, b:1, c:2, d:1, e: 4;
determining a second label to be equalized according to the first statistical result, here, taking b as an example, copying the content 1, and adding the copied 3 content 1 to the first sample set to obtain a second sample set;
and performing statistics again on the second sample set to obtain a second statistical result as follows: a:10, b:4, c:5, d:1, e: 4;
according to the second statistical result, determining that a third label to be balanced is d, copying the content 3, and adding the copied 3 content 3 to the second sample set to obtain a third sample set;
and performing statistics again on the third sample set to obtain a third statistical result: a:13, b:4, c:5, d:4, e: 4;
because the magnitude of the high-frequency label samples is too high after equalization, when a large number of samples are analyzed, if the number of other labels is not changed while the high-frequency label samples are reduced, the number of samples corresponding to the high-frequency labels can be properly reduced. If there is a sample such as "content 5 a", it can be deleted appropriately to ensure the balance of the samples corresponding to each label.
To ensure the richness of the samples, the two samples corresponding to c are respectively sensor 1 and sensor 2. When the number of samples of c is increased, the two samples are repeatedly constructed, rather than only one sample.
According to the scheme, after the occupation ratio of the low-frequency samples is improved, the number of the high-frequency samples caused by the improvement of the number of the samples of the low-frequency samples is reduced, and therefore the balance degree of each label sample in a sample set is further improved.
Fig. 6 is a schematic diagram of a sample equalization apparatus according to an embodiment of the present application. Referring to fig. 6, a sample equalization apparatus 600 provided in the embodiment of the present application includes: a tag determination module 601, a first addition module 602, and a second addition module 603.
The label determining module 601 is configured to determine a target label to be equalized from at least two labels associated with a sample set to be equalized according to a number of samples corresponding to labels in the sample set to be equalized, and use a sample corresponding to the target label in the sample set to be equalized as a target sample, where the sample set to be equalized includes at least two samples, and each sample has at least one label;
a first adding module 602, configured to add the target sample, so that the number of samples corresponding to the target tag in the sample set to be equalized reaches a target sample number, and a new sample set is obtained;
a second adding module 603, configured to add, if the number of samples corresponding to the other tags in the new sample set except the target tag is smaller than the number of target samples, samples corresponding to the other tags in the new sample set except the target tag, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples.
According to the technical scheme, if the number of the samples corresponding to the other labels except the target label in the new sample set is smaller than the number of the target samples, the samples corresponding to the other labels except the target label in the new sample set are added, so that the balance of the other labels except the target label is realized, the balance of the samples of the multi-label sample set is realized, and the balance degree of the multi-label sample set is improved.
Further, the apparatus further comprises:
and the quantity counting module is used for adding the samples corresponding to the other labels except the target label in the new sample set if the quantity of the samples corresponding to the other labels except the target label in the new sample set is less than the quantity of the target samples, so that the quantity of the samples corresponding to the other labels in the sample set to be equalized is counted based on the added target samples before the quantity of the samples corresponding to the other labels in the sample set to be equalized reaches the quantity of the target samples.
Further, wherein the second adding module includes:
a label determining unit, configured to determine, if the number of samples corresponding to other labels in the new sample set except the target label is smaller than the target sample number, a label to be equalized from the other labels and a sample to be equalized corresponding to the label to be equalized in the new sample set according to the number of samples corresponding to the other labels and the target sample number;
and the sample adding unit is used for adding the samples to be equalized to enable the number of the samples of the labels to be equalized in the new sample set to reach the target number of the samples.
Further, wherein the first adding module includes:
the sample difference unit is used for determining the difference between the number of the samples of the target label in the sample set to be equalized and the number of the target samples;
and the sample increasing unit is used for increasing the target samples according to the determined difference value, so that the number of the samples of the target labels reaches the number of the target samples, and a new sample set is obtained.
Further, the first adding module includes:
and the sample adding unit is used for adding at least two target samples according to the number of the target samples if the types of the target samples are at least two and the number of the target samples is at least two, so that the number of the target labels in the sample set to be equalized reaches the number of the target samples, and a new sample set is obtained.
Further, the sample adding unit is specifically configured to:
determining the increased number of various target samples in the at least two target samples according to the number of the target samples, wherein the difference value between the increased number of the various target samples is smaller than a set difference threshold value;
and increasing the at least two target samples according to the determined increasing number of the various target samples to obtain a new sample set.
Further, the apparatus further comprises:
a high-frequency label determining module, configured to, if the number of samples corresponding to other labels in the new sample set except the target label is smaller than the number of target samples, add samples corresponding to other labels in the new sample set except the target label, so that the number of samples corresponding to other labels in the sample set to be equalized reaches the number of target samples, and then determine the high-frequency label of the new sample set;
the high-frequency sample determining module is used for determining the high-frequency samples which are marked with the high-frequency tags and have the number of the tags of one from the new sample set;
and the sample number reducing module is used for reducing the number of the high-frequency samples so as to reduce the difference between the number of the high-frequency labels and the number of the samples of the labels in the new sample set except the high-frequency labels.
Further, the high frequency tag determination module includes:
a difference determining unit, configured to determine a difference between the number of samples corresponding to each label in the new sample set and the target sample number;
and the high-frequency label determining unit is used for determining the high-frequency label from at least two labels corresponding to the new sample set according to the determined difference value.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 7 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor 701 is taken as an example.
The memory 702 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of sample equalization provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the sample equalization method provided herein.
The memory 702, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the sample equalization method in the embodiments of the present application (e.g., the tag determination module 601, the first addition module 602, and the second addition module 603 shown in fig. 6). The processor 701 executes various functional applications of the server and data processing, i.e., a method of implementing sample equalization in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 702.
The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the sample equalization electronics, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 702 may optionally include memory located remotely from processor 701, which may be connected to sample equalization electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the sample equalization method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the sample equalization electronics, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
According to the technical scheme of the embodiment of the application, the balance degree of the number of the samples corresponding to each label in the sample set is improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (18)

1. A method of sample equalization, comprising:
determining a target label to be equalized from at least two labels associated with a sample set to be equalized according to the number of samples corresponding to the labels in the sample set to be equalized, and taking the sample corresponding to the target label in the sample set to be equalized as a target sample, wherein the sample set to be equalized comprises at least two samples, and each sample has at least one label;
increasing the target samples to enable the number of samples corresponding to the target labels in the sample set to be equalized to reach the number of target samples, and obtaining a new sample set;
if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the number of the target samples, adding the samples corresponding to the other labels in the new sample set except the target label, so that the number of samples corresponding to the other labels in the sample set to be equalized reaches the number of the target samples.
2. The method according to claim 1, wherein if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the number of target samples, the method further comprises the steps of adding the samples corresponding to the other labels in the new sample set except the target label, and before the number of samples corresponding to the other labels in the sample set to be equalized reaches the number of target samples:
and counting the number of samples of other labels except the target label in the new sample set based on the added target sample.
3. The method according to claim 1 or 2, wherein if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the target number of samples, increasing the samples corresponding to the other labels in the new sample set except the target label to make the number of samples corresponding to the other labels in the sample set to be equalized reach the target number of samples, includes:
if the number of samples corresponding to other labels in the new sample set except the target label is smaller than the number of the target samples, determining a label to be equalized from the other labels and a sample to be equalized corresponding to the label to be equalized in the new sample set according to the number of samples corresponding to the other labels and the number of the target samples;
and adding the samples to be equalized to enable the number of the samples of the labels to be equalized in the new sample set to reach the target number of the samples.
4. The method according to claim 1 or 2, wherein the adding the target samples to make the number of samples corresponding to the target label in the sample set to be equalized reach a target sample number to obtain a new sample set, comprises:
determining the difference value between the number of the target labels in the sample set to be equalized and the number of the target samples;
and increasing the target samples according to the determined difference value to enable the number of the samples of the target labels to reach the number of the target samples, so as to obtain a new sample set.
5. The method according to claim 1 or 2, wherein the adding the target samples to make the number of samples corresponding to the target label in the sample set to be equalized reach a target sample number to obtain a new sample set, comprises:
if the types of the target samples are at least two, and the number of the target samples is at least two, adding at least two target samples according to the number of the target samples, so that the number of the target labels in the sample set to be equalized reaches the number of the target samples, and obtaining a new sample set.
6. The method of claim 5, wherein the adding at least two target samples according to the target sample number to make the sample number of the target tag in the sample set to be equalized reach the target sample number to obtain a new sample set, comprises:
determining the increased number of various target samples in the at least two target samples according to the number of the target samples, wherein the difference value between the increased number of the various target samples is smaller than a set difference threshold value;
and increasing the at least two target samples according to the determined increasing number of the various target samples to obtain a new sample set.
7. The method according to claim 1 or 2, wherein if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the number of target samples, the method further comprises the steps of adding the samples corresponding to the other labels in the new sample set except the target label, and making the number of samples corresponding to the other labels in the sample set to be equalized reach the number of target samples:
determining a high frequency label for the new sample set;
determining the number of the labels as one from the new sample set, and marking the high-frequency samples with the high-frequency labels;
and reducing the number of the high-frequency samples to reduce the difference between the number of the high-frequency labels and the number of the high-frequency labels in the new sample set.
8. The method of claim 7, wherein the determining the high frequency label of the new sample set comprises:
determining the difference value between the number of samples corresponding to each label in the new sample set and the number of the target samples;
and determining the high-frequency label from at least two labels corresponding to the new sample set according to the determined difference value.
9. An apparatus for sample equalization, comprising:
the label determining module is used for determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, and taking the sample corresponding to the target label in the sample set to be balanced as a target sample, wherein the sample set to be balanced comprises at least two samples, and each sample has at least one label;
a first adding module, configured to add the target samples, so that the number of samples corresponding to the target tag in the sample set to be equalized reaches a target sample number, and a new sample set is obtained;
a second adding module, configured to add, if the number of samples corresponding to the other tags in the new sample set except the target tag is smaller than the number of target samples, samples corresponding to the other tags in the new sample set except the target tag, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples.
10. The apparatus of claim 9, the apparatus further comprising:
and the quantity counting module is used for adding the samples corresponding to the other labels except the target label in the new sample set if the quantity of the samples corresponding to the other labels except the target label in the new sample set is less than the quantity of the target samples, so that the quantity of the samples corresponding to the other labels in the sample set to be equalized is counted based on the added target samples before the quantity of the samples corresponding to the other labels in the sample set to be equalized reaches the quantity of the target samples.
11. The apparatus of claim 9 or 10, wherein the second adding means comprises:
a label determining unit, configured to determine, if the number of samples corresponding to other labels in the new sample set except the target label is smaller than the target sample number, a label to be equalized from the other labels and a sample to be equalized corresponding to the label to be equalized in the new sample set according to the number of samples corresponding to the other labels and the target sample number;
and the sample adding unit is used for adding the samples to be equalized to enable the number of the samples of the labels to be equalized in the new sample set to reach the target number of the samples.
12. The apparatus of claim 9 or 10, wherein the first adding means comprises:
the sample difference unit is used for determining the difference between the number of the samples of the target label in the sample set to be equalized and the number of the target samples;
and the sample increasing unit is used for increasing the target samples according to the determined difference value, so that the number of the samples of the target labels reaches the number of the target samples, and a new sample set is obtained.
13. The apparatus of claim 9 or 10, wherein the first adding means comprises:
and the sample adding unit is used for adding at least two target samples according to the number of the target samples if the types of the target samples are at least two and the number of the target samples is at least two, so that the number of the target labels in the sample set to be equalized reaches the number of the target samples, and a new sample set is obtained.
14. The apparatus according to claim 13, wherein the sample adding unit is specifically configured to:
determining the increased number of various target samples in the at least two target samples according to the number of the target samples, wherein the difference value between the increased number of the various target samples is smaller than a set difference threshold value;
and increasing the at least two target samples according to the determined increasing number of the various target samples to obtain a new sample set.
15. The apparatus of claim 9 or 10, further comprising:
a high-frequency label determining module, configured to, if the number of samples corresponding to other labels in the new sample set except the target label is smaller than the number of target samples, add samples corresponding to other labels in the new sample set except the target label, so that the number of samples corresponding to other labels in the sample set to be equalized reaches the number of target samples, and then determine the high-frequency label of the new sample set;
the high-frequency sample determining module is used for determining the high-frequency samples which are marked with the high-frequency tags and have the number of the tags of one from the new sample set;
and the sample number reducing module is used for reducing the number of the high-frequency samples so as to reduce the difference between the number of the high-frequency labels and the number of the samples of the labels in the new sample set except the high-frequency labels.
16. The apparatus of claim 15, wherein the high frequency tag determination module comprises:
a difference determining unit, configured to determine a difference between the number of samples corresponding to each label in the new sample set and the target sample number;
and the high-frequency label determining unit is used for determining the high-frequency label from at least two labels corresponding to the new sample set according to the determined difference value.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202010899784.1A 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium Active CN112085080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010899784.1A CN112085080B (en) 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010899784.1A CN112085080B (en) 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112085080A true CN112085080A (en) 2020-12-15
CN112085080B CN112085080B (en) 2024-03-08

Family

ID=73731635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010899784.1A Active CN112085080B (en) 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112085080B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874279A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Generate the method and device of applicating category label
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
WO2019169704A1 (en) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 Data classification method, apparatus, device and computer readable storage medium
CN110852379A (en) * 2019-11-11 2020-02-28 北京百度网讯科技有限公司 Training sample generation method and device and electronic equipment
CN111061581A (en) * 2018-10-16 2020-04-24 阿里巴巴集团控股有限公司 Fault detection method, device and equipment
CN111079811A (en) * 2019-12-06 2020-04-28 西安电子科技大学 Sampling method for multi-label classified data imbalance problem
KR20200054121A (en) * 2019-11-29 2020-05-19 주식회사 루닛 Method for machine learning and apparatus for the same
CN111198906A (en) * 2019-12-20 2020-05-26 天阳宏业科技股份有限公司 Data processing method, device and system and storage medium
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874279A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Generate the method and device of applicating category label
WO2019169704A1 (en) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 Data classification method, apparatus, device and computer readable storage medium
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN111061581A (en) * 2018-10-16 2020-04-24 阿里巴巴集团控股有限公司 Fault detection method, device and equipment
CN110852379A (en) * 2019-11-11 2020-02-28 北京百度网讯科技有限公司 Training sample generation method and device and electronic equipment
KR20200054121A (en) * 2019-11-29 2020-05-19 주식회사 루닛 Method for machine learning and apparatus for the same
CN111079811A (en) * 2019-12-06 2020-04-28 西安电子科技大学 Sampling method for multi-label classified data imbalance problem
CN111198906A (en) * 2019-12-20 2020-05-26 天阳宏业科技股份有限公司 Data processing method, device and system and storage medium
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
R. A. SKEFFINGTON 等: "Using high-frequency water quality data to assess sampling strategies for the EU Water Framework Directive", HYDROLOGY AND EARTH SYSTEM SCIENCES *
李思豪;陈福才;黄瑞阳;: "一种多标签随机均衡采样算法", 计算机应用研究, no. 10 *

Also Published As

Publication number Publication date
CN112085080B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN112560912B (en) Classification model training method and device, electronic equipment and storage medium
CN111507104B (en) Method and device for establishing label labeling model, electronic equipment and readable storage medium
JP7235817B2 (en) Machine translation model training method, apparatus and electronic equipment
KR20210132578A (en) Method, apparatus, device and storage medium for constructing knowledge graph
CN112036509A (en) Method and apparatus for training image recognition models
CN111753914A (en) Model optimization method and device, electronic equipment and storage medium
CN111104514A (en) Method and device for training document label model
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
CN112650907A (en) Search word recommendation method, target model training method, device and equipment
CN110427436B (en) Method and device for calculating entity similarity
CN111858905B (en) Model training method, information identification device, electronic equipment and storage medium
CN112380847B (en) Point-of-interest processing method and device, electronic equipment and storage medium
CN111522944A (en) Method, apparatus, device and storage medium for outputting information
CN111241810A (en) Punctuation prediction method and device
CN111783427B (en) Method, device, equipment and storage medium for training model and outputting information
CN112182292A (en) Training method and device for video retrieval model, electronic equipment and storage medium
CN111310058B (en) Information theme recommendation method, device, terminal and storage medium
CN113342946B (en) Model training method and device for customer service robot, electronic equipment and medium
CN112597288B (en) Man-machine interaction method, device, equipment and storage medium
CN112329453B (en) Method, device, equipment and storage medium for generating sample chapter
CN112115334A (en) Method, device, equipment and storage medium for distinguishing hot content of network community
CN112016523A (en) Cross-modal face recognition method, device, equipment and storage medium
CN112016524A (en) Model training method, face recognition device, face recognition equipment and medium
CN112085080B (en) Sample equalization method, device, equipment and storage medium
CN113590914B (en) Information processing method, apparatus, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant