WO2020220220A1 - Classification model training method and device, and computer-readable medium - Google Patents

Classification model training method and device, and computer-readable medium Download PDF

Info

Publication number
WO2020220220A1
WO2020220220A1 PCT/CN2019/085054 CN2019085054W WO2020220220A1 WO 2020220220 A1 WO2020220220 A1 WO 2020220220A1 CN 2019085054 W CN2019085054 W CN 2019085054W WO 2020220220 A1 WO2020220220 A1 WO 2020220220A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data set
training
training data
samples
Prior art date
Application number
PCT/CN2019/085054
Other languages
French (fr)
Chinese (zh)
Inventor
周林飞
吴超华
施内加斯·丹尼尔
田鹏伟
李聪超
吴文超
Original Assignee
西门子(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西门子(中国)有限公司 filed Critical 西门子(中国)有限公司
Priority to CN201980095070.0A priority Critical patent/CN113692589A/en
Priority to PCT/CN2019/085054 priority patent/WO2020220220A1/en
Publication of WO2020220220A1 publication Critical patent/WO2020220220A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present invention relates to the field of data processing, in particular to a classification model training method, device and computer readable medium.
  • the value of the equipment operating data will remain within a normal range. If the value of the operating data exceeds the normal range, it may be caused by the equipment failure. Therefore, the historical operation of the equipment during normal operation
  • the data is used to train the classification model, and then the operating data of the equipment can be input into the classification model in real time, and the classification model can determine whether there is a failure during the operation of the equipment.
  • the refining equipment includes pipelines for transferring liquids.
  • the working conditions of the pipelines can be determined by monitoring the liquid flow rate, pressure and temperature in the pipelines. For example, the flow rate data will decrease when the pipeline is blocked, and when the pipeline leaks Time pressure data will decrease.
  • the equipment When the historical operating data of the equipment is used as training data to train the classification model, the equipment may have a variety of different operating modes, and the training data may be historical operating data of the equipment in multiple operating modes, because the equipment is in different operating modes The generated running data is different, so the training data may be unbalanced.
  • the training data consists of a first data set and a second data set, where the first data set includes historical operating data when the device is running in the first operating mode, and the second data set includes the device running in the second operating mode.
  • the training data is not balanced.
  • the judgment conditions used to judge the data imbalance can be determined according to the actual situation. For example, when the ratio of the number of samples in the first data set to the number of samples in the second data set is less than a preset threshold, it is determined that the training data is not balanced .
  • the training data obtained is directly used to train the classification model. Since the training data may be unbalanced, if the unbalanced training data is used to train the classification model, one situation that may result is to use the trained classification model to perform real-time monitoring of the device.
  • the operating data falling within the corresponding numerical range of the first data set will erroneously conclude that the equipment is abnormal, so that a large number of false alarms will be generated when the classification model is used to determine the operating status of the equipment.
  • the classification accuracy is low.
  • the classification model training method, device and computer readable medium provided by the present invention can improve the classification accuracy of the trained classification model.
  • an embodiment of the present invention provides a classification model training method, including:
  • the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data An unbalanced first data set, the first data set identified by different data set identifiers is different;
  • the first training data is equalized for the first data set identified by each data set identifier to obtain the second training data, wherein the second training data corresponds to each data set identifier
  • Each second data set of will not cause the second training data to be unbalanced
  • the second training data is used to train a classification model corresponding to the target device.
  • the first training data After acquiring the first training data, determine whether the first training data is balanced. If the first training data is not balanced, send an interactive request to the user, requesting the user to determine the way to process the first training data. When the user responds After the equalization processing instruction of the interactive request, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device in the normal operation of the second training data will not cause the second training data. The training data is not balanced, and then the second training data is used to train the classification model.
  • the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used
  • the trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.
  • the fourth data set including the smallest number of samples and the fifth data set including the largest number of samples are determined from each third data set, and then the fourth data set is determined Whether the ratio of the number of samples in the fifth data set to the number of samples in the fifth data set is less than the preset proportion threshold, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold, it is determined that the first training data is not balanced. If the ratio of the number of samples in the data set to the fifth data set is greater than or equal to the proportion threshold, the first training data is determined to be balanced.
  • each sample included in the first training data is clustered into one or more third data sets, that is, each sample with a value in the same numerical range is classified into the same third data concentrated. If the clustering process only obtains a third data set, it means that the values of each sample in the first training data are within the same numerical range, that is, the first training data is balanced. If the clustering process obtains multiple third data sets, calculate the ratio of the third data set with the smallest number of samples to the third data set with the largest number of samples.
  • the percentage threshold of indicates that the number of samples in the different third data sets is quite different, the sample distribution in the first training data is not balanced, it is determined that the first training data is not balanced, and vice versa. According to the number of the third data set and the number of samples in the third data set, it is determined whether the first training data is balanced in two stages, so as to ensure that the balance of the first training data can be accurately judged.
  • a sixth data set corresponding to the data set identifier is determined,
  • the value of the sample in the sixth data set is within the same numerical range as the value of the sample in the first data set identified by the data set identifier, and it is necessary to ensure that the first data set identified by the data set identifier and the data set identifier
  • the ratio of the total number of samples in the sixth data set to the number of samples in the fifth data set is greater than or equal to the proportion threshold.
  • the first data set is determined by the user to cause the first training data to be unbalanced and the included samples are historical operating data when the target device is running normally.
  • the first training data is unbalanced due to the small number of samples included in the first data set Therefore, a corresponding sixth data set is determined for each first data set, so that the value of the sample in the sixth data set and the value of the sample in the first data set are in the same numerical range, so that the Combining the samples in the sixth data set with the first training data essentially expands the samples in the first data set, so that the historical operating data of the target device in the second training data during normal operation will not cause the second training data to fail. balanced.
  • At least one sample may be collected from the first data set identified by the data set identifier, and then each sample collected The combination of is used as the sixth data set corresponding to the data set identifier.
  • the value of the sample in the determined sixth data set is required to be within the same numerical range as the value of the sample in the first data set identified by the data set identifier. Collect samples directly from the first data set identified by the data set identifier as the sixth data set corresponding to the data set identifier, so that the first training data can be equalized more conveniently, because there is no need to find other Historical operating data can improve the efficiency of separate model training.
  • the data read address included in the data reselection instruction is read from the corresponding The storage space reads the third training data, and then uses the third training data as the first training data to restart the judgment on the balance of the first training data.
  • the first training data obtained before is discarded, and the first training data is reacquired according to the data reselection instruction.
  • the first training data with more balanced sample distribution can be re-selected to train the classification model, which can meet the needs of different users and improve The applicability of the classification model training method.
  • the present invention also provides a classification model training device, including:
  • a data acquisition module for acquiring first training data, where the first training data includes historical operating data of the target device;
  • a data judgment module for judging whether the first training data obtained by the data obtaining module is balanced
  • a request sending module is used to send an interactive request to the user if the first training data is unbalanced according to the judgment result of the data judgment module, where the interactive request is used to request the user to determine the way to process the first training data;
  • An instruction receiving module for receiving a balance processing instruction from a user in response to an interaction request sent by the request sending module, wherein the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used for In identifying a first data set in the first training data that causes the first training data to be unbalanced, the first data sets identified by different data set identifiers are different;
  • a data processing module is configured to perform equalization processing on the first training data for the first data set identified by each data set identifier according to the equalization processing instruction received by the instruction receiving module to obtain the second training data, wherein , Each second data set corresponding to each data set identifier in the second training data will not cause the second training data to be unbalanced;
  • a model training module is used to train the classification model corresponding to the target device using the second training data obtained by the data processing module.
  • the data judgment module determines whether the first training data is balanced, and requests the sending module to send an interaction request to the user after the data judgment module judges that the first training data is unbalanced according to the judgment result of the data judgment module
  • the instruction receiving module receives the equalization processing instruction of the user in response to the interaction request sent by the request sending module
  • the data processing module compares the first data set identified by each data set identifier in the equalization processing instruction to the first training data Perform equalization processing to obtain a second data set corresponding to the data set identifier that does not cause unbalanced second training data.
  • the model training module uses the second training data obtained by the data processing module to train the corresponding target device Classification model.
  • the model training module trains the classification model
  • the first training data is unbalanced due to the small number of historical operating data samples corresponding to the normal operation of the target device in the first training data
  • the first training data is converted to the second training data. So that the historical operating data of the target device during normal operation will not cause the second training data to be unbalanced, and then use the second training data to train the classification model corresponding to the target device to ensure that the trained classification model will not be based on the normal target device
  • the running data at runtime determines that the target device is abnormal, which can improve the classification accuracy of the trained classification model.
  • the data judgment module includes:
  • a clustering processing unit is used to perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in Within the same numerical range, the values of samples in different third data sets are in different numerical ranges;
  • a first judging unit configured to determine the balance of the first training data when the clustering processing unit performs clustering processing on the first training data to obtain a third data set;
  • a second judging unit for determining the fourth data with the least number of samples from the at least two third data sets when the clustering processing unit performs clustering processing on the first training data to obtain at least two third data sets If the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, it is determined that the first training data is not balanced. If the fourth data set is The ratio of the number of samples in the fifth data set is greater than or equal to the proportion threshold, and the first training data balance is determined.
  • the clustering processing unit may cluster the first training data into one or more third data sets, so that the values of samples in each third data set have the same numerical range.
  • the first judgment unit and the second judgment unit are based on the aggregation
  • the quantity of the third data set obtained by the class processing unit is subjected to subsequent processing. If the clustering processing unit obtains only one third data set, the first judgment unit determines that the first training data is balanced.
  • the second judging unit first determines from each third data set the fourth data set including the smallest number of samples and the fifth data set including the largest number of samples, and then It is determined whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, and if so, it is determined that the first training data is unbalanced; otherwise, it is determined that the first training data is balanced.
  • the first judgment unit and the second judgment unit determine whether the first training data is balanced in two stages based on the results of the clustering processing performed by the clustering processing unit. The judgment process combines the number of the third data set and the number of samples in the third data set , To ensure the accuracy of the balance judgment on the first training data.
  • the data processing module includes:
  • a data collection unit for determining a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein the sample in the sixth data set is identified by the data set identifier
  • the values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion threshold ;
  • a data combination unit is used to combine samples in each sixth data set determined by the data collection unit with the first training data to obtain second training data.
  • the data collection unit may determine a sixth data set corresponding to the data set identifier, so that the value of the sample in the sixth data set is the same as the first data set identifier identified by the data set identifier.
  • the values of samples in a data set are within the same numerical range, and it is ensured that the ratio of the total number of samples in the first data set to the number of samples in the fifth data set in the sixth data set and the data set identifier is greater than or equal to the proportion threshold.
  • the data combination unit may combine the samples in each sixth data set determined by the data collection unit with the first training data to obtain the combined second training data.
  • the data collection unit determines a sixth data set that includes at least one sample based on the value range of the sample in the first data set identified by the data set identifier to expand the first data set identified by the data set identifier.
  • a data set after the data combination unit combines the samples in each sixth data set with the first training data, the samples in the second data set corresponding to the data set identifier in the second training data set correspond to the data set identifier.
  • the data collection unit may collect samples from the first data set identified by the data set identifier, and then use the collected sample set as the sixth data set identifier corresponding to the data set identifier. data set.
  • the data collection unit directly collects samples from the first data set identified by the data set identifier, and then uses the collected sample set as the sixth data set corresponding to the data set identifier, because it corresponds to the first data set identified by the same data set.
  • the data set and the samples in the sixth data set are combined to form the second data set corresponding to the data set identifier in the second training data. This ensures that the value of the sample in the sixth data set is the same as the value of the sample in the first data set.
  • the values are in the same numerical range to ensure the effectiveness and accuracy of equalizing the first training data.
  • the classification model training device may further include a data reselection module.
  • the data reselection module may read the data according to the data read address included in the data reselection instruction.
  • the third training data is read from the corresponding storage space, and the third training data is used as the first training data to trigger the data judgment module to verify the balance of the new first training data.
  • the data reselection module can reselect the training data according to the user's instructions to meet the user's need to discard the previous first training data and reselect the training data used to train the classification model, so as to meet different usage needs and help improve The applicability of the classification model for training.
  • an embodiment of the present invention also provides a classification model training device, including: at least one memory and at least one processor;
  • At least one memory for storing machine-readable programs
  • At least one processor is configured to invoke a machine-readable program to execute the foregoing first aspect or the method provided in any possible implementation manner of the first aspect.
  • a machine-readable program is stored in the memory, and the processor can execute the above-mentioned first aspect or the method provided in any implementable manner of the first aspect by calling the machine-readable program stored in the memory, and after obtaining the first aspect After the training data, determine whether the first training data is balanced. If the first training data is not balanced, send an interactive request to the user, requesting the user to determine the way to process the first training data. When the user responds to the equalization of the interactive request After the instruction is processed, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device during normal operation in the second training data will not cause the second training data to be unbalanced, Then use the second training data to train the classification model.
  • the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used
  • the trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.
  • an embodiment of the present invention also provides a computer-readable medium on which computer instructions are stored.
  • the processor executes the first aspect or the first aspect described above. The method provided by any possible implementation.
  • computer instructions are stored on the machine-readable medium, and when the computer instructions are executed by the processor, the processor will execute the distributed model training method provided by the first aspect and any possible implementation of the first aspect, and obtain After the first training data is received, it is judged whether the first training data is balanced. If the first training data is not balanced, an interactive request is sent to the user, requesting the user to determine the way to process the first training data, when the user responds to the interactive request After the equalization processing instruction of the equalization processing instruction, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device in the normal operation of the second training data will not cause the second training data Unbalanced, and then use the second training data to train the classification model.
  • the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used
  • the trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.
  • Figure 1 is a flowchart of a classification model training method provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of a first training data balance judgment method according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a method for equalizing and processing first training data according to an embodiment of the present invention
  • FIG. 4 is a flowchart of a sixth data set determination method according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a method for reselecting training data according to an embodiment of the present invention
  • Fig. 6 is a flowchart of another classification model training method provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a classification model training device provided by an embodiment of the present invention.
  • Fig. 8 is a schematic diagram of another classification model training device provided by an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of another classification model training device provided by an embodiment of the present invention.
  • Fig. 10 is a schematic diagram of a classification model training device including a data re-module provided by an embodiment of the present invention
  • Fig. 11 is a schematic diagram of still another classification model training device provided by an embodiment of the present invention.
  • step 503 Use the third training data as the first training data, and perform step 102
  • 602 Perform clustering processing on the first training data to obtain at least one third data set
  • 606 Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold
  • Data acquisition module 702 Data judgment module 703: Request to send module
  • Instruction receiving module 705 Data processing module 706: Model training module
  • Data Reselection Module 7021 Clustering Processing Unit 7022: First Judging Unit
  • Memory 1102 Processor
  • the historical operating data of the device is directly used as the training data to train the classification model.
  • the trained classification model will consider the training The samples in the first data set that include a small number of samples are operating data generated when the device is abnormal.
  • the classification model is based on the value falling into the corresponding value range of the first data set The classification model will conclude that the equipment is abnormal from the normal operating data in the data, so the classification model will generate a large number of false alarms, making the classification accuracy of the classification model low.
  • the first training data used to train the classification model after acquiring the first training data used to train the classification model, it is first judged whether the first training data is balanced, and when it is determined that the first training data is not balanced, an interaction request is sent to the user, and the user determines that the first training data is not balanced.
  • a method of processing training data Afterwards, if a user's equalization processing instruction in response to an interactive request is received, the first training data is equalized into the second training data according to the equalization processing instruction, so that the target device can run normally The historical running data of ”will not cause the second training data to be unbalanced, and then use the second training data to train the classification model corresponding to the target device.
  • the second training data is obtained by equalizing the first training data, so that the historical operating data during normal operation of the device is not balanced. It will cause the second training data to be unbalanced, thereby reducing the false alarm probability of the classification model trained by using the second training data, and improving the classification accuracy of the trained classification model.
  • an embodiment of the present invention provides a classification model training method, which may include the following steps:
  • Step 101 Acquire first training data, where the first training data includes historical operating data of the target device;
  • Step 102 Determine whether the first training data is balanced
  • Step 103 If the first training data is not balanced, send an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;
  • Step 104 Receive a balance processing instruction from the user in response to the interaction request, where the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data.
  • the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data.
  • a first data set with unbalanced training data, and the first data sets identified by different data set identifiers are different;
  • Step 105 According to the equalization processing instruction, perform equalization processing on the first training data for the first data set identified by each data set identifier to obtain second training data, where the second training data and each data set Each second data set corresponding to the identifier will not cause the second training data to be unbalanced;
  • Step 106 Use the second training data to train a classification model corresponding to the target device.
  • the classification model training method In the classification model training method provided by the embodiment of the present invention, after obtaining the first training data including the historical operating data of the target device, it is judged whether the first training data is balanced, and when it is determined that the first training data is not balanced, an interaction is sent to the user The request is determined by the user to process the first training data. After receiving the equalization processing instruction from the user in response to the interaction request, it is directed to the first training data identified by each data set identifier included in the equalization processing instruction. A data set performs equalization processing on the first training data.
  • the second training data obtained by the equalization processing the second data set corresponding to each data set identifier will not cause the second training data to be unbalanced, and then use
  • the obtained second training data is used to train the classification model corresponding to the target device. It can be seen that when it is determined that the first training data is unbalanced, the user determines whether the first training data needs to be equalized and the first data set for which the equalization is processed. When the user determines to perform the first training data During the equalization process, the second training data is obtained by equalizing the first training data, so that the historical operating data of the target device during normal operation in the second training data will not cause the second training data to be unbalanced, thus using the second training data. The classification model trained by the training data will not be misjudged due to the imbalance of the training data, ensuring that the trained classification model has a high classification accuracy.
  • the historical operation data including a certain number of samples can be selected from the historical operation data of the target device as the first training data.
  • the target device can be set as the first training data in a continuous period of time.
  • the historical operating data of the device is used as the first training data, and the historical operating data of the target device in multiple discontinuous time periods may also be used as the first training data.
  • the operating environment of different devices is not completely the same, when a classification model needs to be trained for a device to monitor whether the device is abnormal through the classification model, in order to ensure that the trained classification model has a higher classification accuracy, it is necessary to The historical operating data of the device is used as training data to train the classification model.
  • the interaction request when it is determined that the first training data is unbalanced and an interaction request is sent to the user, the interaction request may include multiple alternatives, so that the user can select the one to process the first training data from each alternative. the way.
  • the interaction request may include the following three alternative options: training a classification model based on existing training data, reselecting training data, and performing equalization processing on training data. The following describes the processing methods after the user selects the above three alternatives:
  • the first training data is directly used to train the classification model corresponding to the target device.
  • at least one of the first training data including a small number of samples in the first data set causes the first training data to be unbalanced, and each of the samples included in the first data set is the historical operation when the target device operates abnormally Data, at this time, directly use the first training data to train the classification model.
  • the trained classification model to analyze the operating data of the target device in real time, if the operating data falls into any value corresponding to the first data set Range, the classification model will give the conclusion that the target device is abnormal.
  • the training data is re-read according to the data storage address provided by the user, and the re-read training data is used as the first training data to start step 102.
  • the user confirms that the first training data selected for training the classification model is wrong, reselects the training data according to the user's instruction, and further determines whether the reselected training data is balanced.
  • each data set that causes the first training data to be unbalanced is shown to the user, and the user selects the first data set from the displayed data sets, and the user can generate the The equalization processing instruction identified by each data set of each first data set.
  • at least one data set that causes the first training data to be unbalanced can be determined, but a small number of samples included in the data set may be historical operating data when the target device is operating abnormally , It may also be the historical operating data of the target device during normal operation. At this time, it needs to be distinguished by the user, and the data set containing a small number of samples as historical operating data of the target device during normal operation is determined as the first data set.
  • a data set performs equalization processing on the first training data.
  • the historical operating data of the target device during normal operation will not cause the second training data to be unbalanced, while the historical operating data of the target device during abnormal operation This may cause the second training data to be unbalanced.
  • the classification model trained by using the second training data can more accurately identify whether the target device is abnormal.
  • the above three alternatives can be displayed to the user in the form of a prompt box, and then further interaction with the user is carried out according to the alternatives selected by the user, or the second alternative is directly based on the alternatives selected by the user.
  • One training data is processed.
  • the first training data when determining whether the first training data is balanced in step 102, if the judgment result is that the first training data is balanced, the first training data is directly used to train the classification model corresponding to the target device.
  • step 106 uses the second training data to train the classification model corresponding to the target device
  • the second training data can be used as input to train the classification model through various types of machine algorithms.
  • the specific machine algorithm can be Artificial neural network algorithms, deep learning algorithms, core-based algorithms, integrated algorithms, genetic algorithms, etc.
  • the first training data may be clustered to cluster the first training data into one or more Data sets, and then determine whether the first training data is balanced according to the ratio of the number of data sets obtained by clustering and the number of samples in the data set.
  • determining whether the first training data is balanced can be achieved through the following steps:
  • Step 201 Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the values of each sample in the same third data set are within the same numerical range, The values of the samples in different third data sets are in different numerical ranges;
  • Step 202 Determine whether clustering of the first training data only obtains a third data set, if yes, go to step 203, otherwise go to step 204;
  • Step 203 Determine the first training data balance, and end the current process
  • Step 204 Determine, from at least two third data sets, a fourth data set that includes the smallest number of samples and a fifth data set that includes the largest number of samples;
  • Step 205 Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, if yes, go to step 206, otherwise go to step 203;
  • Step 206 Determine that the first training data is not balanced.
  • the first training data by performing clustering processing on the first training data, the first training data can be clustered into at least one third data set, so that each third data set includes at least one sample, and the same The values of each sample in the three data sets are in the same numerical range, and the values of the samples in the different third data sets are in different numerical ranges.
  • the first training data is composed of historical operating data of the target device. Since the value of the operating data of the target device is in the same numerical range when the target device is running normally in the same operating mode, one or more can be obtained by clustering the first training data. For multiple third data sets, the values of each sample in the same third data set are within the same numerical range. For example, the target device has two operating modes.
  • the value range of operating data is 50 ⁇ 80
  • the value range of operating data is 120 ⁇ 150
  • the third data set 1, the third data set 2, and the third data set 3 are obtained by clustering the first training data to obtain three third data sets, where each sample in the third data set 1
  • the value of is in the range of 50-80
  • the value of each sample in the third data set 2 is in the range of 120-150
  • the value of each sample in the third data set 3 is in the range of 200-240.
  • the clustering process After at least one third data set is obtained by performing clustering processing on the first training data, first, a preliminary judgment is made on whether the first training data is balanced according to the number of the third data sets. If the clustering process only obtains one third data set, that is, the values of all samples included in the first training data are within the same numerical range, then the first training data can be determined to be balanced; if the clustering process obtains at least two third For the data set, it is necessary to further determine whether the first training data is balanced according to the number of samples in each third data set.
  • the fourth data set with the smallest number and the The fifth data set with the largest number of samples then compare the number of samples in the fourth data set and the fifth data set, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, It is determined that the first training data is unbalanced, and if the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced.
  • the fourth data set is the data set with the smallest number of samples in all the third data sets
  • the fifth data set is the data set with the largest number of samples in all the third data sets, that is, the fourth data set and the fifth data set
  • the difference in the number of samples is the largest. If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than the preset proportion threshold, it means that there is at least one sample number in the third data set in the first training data. There is a large gap in the number of samples in the data set, that is, data imbalance appears in the first training data.
  • the proportion threshold can be flexibly set within the range of 0.5 to 1.0.
  • the proportion threshold can be specifically The value is 0.6, 0.7, 0.85, or 0.9.
  • the first training data when performing clustering processing on the first training data in step 201, can be clustered through K-means clustering algorithm, spectral clustering algorithm, Gaussian mixture model clustering algorithm, etc.
  • the training data is clustered.
  • One or more third data sets are obtained by clustering the first training data. According to the number of third data sets and the number of samples included in the third data set, it is determined whether the first training data is balanced in two stages , Ensuring that the balance of the first training data can be accurately judged, thereby ensuring that the classification model trained by using the first training data or the second training data has high classification accuracy.
  • the first training data can be determined by performing histogram curve fitting to determine whether the first training data is balanced, and the first training data can also be estimated by estimating the distribution of the first training data. 1. Whether the training data is balanced.
  • the first training data can be equalized in the following manner:
  • Step 301 For each data set identifier included in the equalization processing instruction, determine a sixth data set corresponding to the data set identifier, wherein the sample in the sixth data set and the first data identified by the data set identifier The values of the concentrated samples are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion threshold;
  • Step 302 Combine the determined samples in each sixth data set corresponding to each data set identifier with the first training data to obtain second training data.
  • the data set identifier is used to identify a corresponding first data set, and the first data set is obtained by the user from each third data set.
  • the first training data is not balanced due to the small number of samples in the first data set.
  • a corresponding sixth data set can be determined for the data set identifier, and the sixth data set includes There is at least one sample, the value of each sample in the sixth data set is in the same value range as the value of the sample in the first data set, and the total number of samples in the sixth data set and the first data set is equal to the number of samples in the fifth data set The ratio of the number is greater than or equal to the proportion threshold.
  • the samples in the sixth data set corresponding to each data set identifier are combined with the first training data to obtain second training data.
  • the second training data for each first data set that causes the first training data to be unbalanced, since the total number of samples in the first data set and the corresponding sixth data set is greater than or equal to the proportion threshold, the first data set When one data set and the corresponding sixth data set are viewed as a whole, the second training data will not be unbalanced.
  • the equalization processing instruction includes a data set identifier 1.
  • the data set identifier 1 is used to identify the first data set 1 obtained by processing the first training data, and the first data set 1 is the third data set 2 in the foregoing embodiment.
  • the first training data is not balanced due to the small number of samples included in the first data set 1 compared to the third data set 1. Therefore, one is determined for the data set identifier 1.
  • the sixth data set, the values of the samples included in the sixth data set are all in the range of 120-150 (the same value range as the samples in the first data set 1), and the sixth data set is the same as the samples in the first data set 1.
  • the ratio of the total number of to the number of samples in the fifth data set (here, the third data set 1) is greater than or equal to the proportion threshold. In this way, since the samples in the first data set 1 and the sixth data set have the same light value range, the first data set 1 and the sixth data set are samples of the same kind, and the first data set 1 and the sixth data set are combined together Will not cause the second training data to be unbalanced.
  • the first data identified by the data set identifier may be Collect samples collectively as the sixth data set corresponding to the data set identifier.
  • the method for determining the sixth data set corresponding to the data set identifier is shown in FIG. 4, which may specifically include the following steps:
  • Step 401 Collect at least one sample from the first data set identified by the data set identifier
  • Step 402 Use a sample set including at least one collected sample as a sixth data set corresponding to the data set identifier.
  • the first data set identifier identified by the data set identifier in order to obtain samples in the same numerical range as the sample in the first data set identified by the data set identifier, the first data set identifier identified by the data set identifier Samples are collected in a data set, and then a data set including each collected sample is used as a sixth data set corresponding to the data set identifier.
  • a data set including each collected sample is used as a sixth data set corresponding to the data set identifier.
  • the samples in the sixth data set are collected from the first data set, it is essentially Part or all of the samples in the first data set are copied one or more times to obtain the sixth data set, which not only ensures that the samples in the sixth data set and the corresponding samples in the first data set have the same value range, but also ensures that the sixth data is determined Convenience of the set.
  • all samples in the first data set may be copied one or more times, and random sampling methods may also be used to randomly collect samples from the first data set.
  • samples can be collected from the historical operating data of the target device, and new samples can be directly generated based on the samples in the first data set.
  • the training data can be reselected according to the user's instruction to solve this problem.
  • the first problem of unbalanced training data may include the following steps:
  • Step 501 Receive a data reselection instruction from a user in response to an interaction request, where the data reselection instruction includes a data read address;
  • Step 502 According to the data reselection instruction, read the third training data from the storage space corresponding to the data read address;
  • Step 503 Use the third training data as the first training data, and execute to determine whether the first training data is balanced.
  • the third training data is read from the corresponding storage space according to the data read address included in the data reselection instruction, and then the third training data is read.
  • the third training data of is used as the first training data, and step 102 is restarted.
  • the method may include the following steps:
  • Step 601 Obtain first training data.
  • a certain amount of historical operating data is obtained from the historical operating data of motor A as the first training data, specifically obtaining the information of motor A Current data. For example, the current data of motor A running in the past 3 months is acquired as the first training data.
  • Step 602 Perform clustering processing on the first training data to obtain at least one third data set.
  • the first training data is clustered, and the first training data is clustered into at least one third data set, so that each third data set includes At least one sample, and the values of each sample in the same third data set are within the same numerical range, and the values of the samples in different third data sets are within different numerical ranges. That is, each third data set corresponds to a numerical range, and the numerical ranges corresponding to different third data sets do not overlap. In addition, each sample corresponds to a piece of current data.
  • the third data set 1 includes 8000
  • the third data set 2 includes 1000 samples
  • the third data set 3 includes 2000 samples
  • the value of each sample in the third data set 1 is in the range of 50 to 80
  • the third data set 2 The value of each sample is in the range of 120-150
  • the value of each sample in the third data set 3 is in the range of 200-240.
  • Step 603 Determine whether only one third data set is obtained, if yes, go to step 604, otherwise go to step 605.
  • step 604 is performed accordingly. If at least two third data sets are obtained, it is necessary to further determine whether the first training data is balanced, and step 605 is performed accordingly.
  • Step 604 Use the first training data to train the classification model, and end the current process.
  • the first training data is clustered and only one third data set is obtained, it is explained that the first training data is balanced, and the first training data is directly used to train the classification model corresponding to motor A.
  • Step 605 Determine the fourth data set and the fifth data set from each third data set.
  • the fourth data set including the smallest number of samples is determined from each third data set, and the fourth data set including the largest number of samples is determined from each third data set The fifth data set.
  • the third data set 1 is determined as the fifth data set
  • the third data set 2 is determined as the fourth data set.
  • Step 606 Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold, if yes, go to step 607, otherwise go to step 604.
  • the ratio of the number of samples in the fourth data set and the fifth data set is compared with a preset threshold value. If the ratio of the number of samples in the data set to the number of samples in the fifth data set is less than the proportion threshold, it is determined that the first training data is unbalanced, and step 607 is performed accordingly. If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to If the proportion threshold is determined, the first training data is determined to be balanced, and step 604 is executed accordingly.
  • Step 607 Send an interaction request to the user.
  • an interaction request is sent to the user, requesting the user to determine a method for processing the first training data.
  • Step 608 Determine whether a model training instruction from the user in response to the interaction request is received, if yes, execute step 604, otherwise execute step 609.
  • step 604 uses the first training data to train a classification model corresponding to motor A.
  • Step 609 Determine whether a data reselection instruction from the user in response to the interaction request is received, if yes, go to step 610, otherwise go to step 611.
  • step 610 is performed accordingly.
  • Step 610 Re-acquire the first training data according to the data reselection instruction, and execute step 602.
  • step 602 is executed.
  • the data read address included in the data reselection instruction is the current data storage address of motor A, and then the current data of motor A is read from the current data storage address as the new first training data, and then step 602 is restarted.
  • Step 611 Determine whether the equalization processing instruction of the user in response to the interaction request is received, if yes, execute step 612, otherwise execute 615.
  • step 612 is performed accordingly. Otherwise, it indicates that the user No corresponding instructions are given and the current process ends.
  • Step 612 Determine the sixth data set corresponding to each data set identifier included in the equalization processing instruction.
  • each data set identifier included in the equalization processing instruction is acquired, each data set identifier is used to identify a first data set, and different data set identifiers are used for Identify different first data sets, where the first data set is a data set selected by the user from each third data set that causes the first training data to be unbalanced.
  • each acquired data set identifier determines a sixth data set corresponding to the data set identifier, where the sixth data set includes at least one sample, the value of the sample in the sixth data set and the data set identifier
  • the values of the samples in the first data set identified are in the same numerical range, and the ratio of the total number of samples in the first data set identified by the sixth data set and the data set identifier to the number of samples in the fifth data set is greater than or equal to Ratio threshold.
  • samples may be collected from the first data set identified by the data set identifier, and the collected samples may be combined into a sixth data set corresponding to the data set identifier.
  • the number of samples in the sixth data set corresponding to the data set identifier satisfies the following conditions: the sixth data set corresponding to the data set identifier and the first data identified by the data set identifier
  • the ratio of the total number of samples in the concentration to the number of samples in the fifth data set is equal to the proportion threshold.
  • the equalization processing instruction includes the data set identifier 1
  • the data set identifier 1 is used to identify the third data set 2
  • the preset ratio threshold is 0.8
  • 5400 samples are collected from the third data set 2
  • the collected 5400 samples are taken as the sixth data set corresponding to data set identifier 1. It should be noted that, after user interaction, the user determines that the samples in the third data set 3 are current data when the motor A is running abnormally, so there is no need to perform equalization processing on the first training data for the third data set 3.
  • Step 613 Combine the samples in each sixth data set with the first training data to obtain second training data.
  • the obtained samples in each sixth data set are combined with the first training data to obtain the second training data.
  • 11000 samples included in the first training data are combined with 5400 samples included in the sixth data set corresponding to the data set identifier 1, to obtain second training data including 16,400 samples.
  • Step 614 Use the second training data to train the classification model.
  • the obtained second training data is used to train the classification model corresponding to the motor A.
  • Step 615 End the current process.
  • Step 609 and step 611 can be executed before step 608, step 611 can be executed before step 609, and so on.
  • an embodiment of the present invention provides a classification model training device, including:
  • a request sending module 703 is used to send an interactive request to the user according to the judgment result of the data judgment module 702 if the first training data is not balanced, where the interactive request is used to request the user to determine how to process the first training data ;
  • An instruction receiving module 704 for receiving a balance processing instruction from a user in response to an interaction request sent by the request sending module 703, wherein the balance processing instruction includes at least one data set identifier, and each data set in the at least one data set identifier The identifier is used to identify a first data set in the first training data that causes the first training data to be unbalanced, and the first data sets identified by different data set identifiers are different;
  • a data processing module 705 is configured to perform equalization processing on the first training data for the first data set identified by each data set identifier according to the equalization processing instruction received by the instruction receiving module 704 to obtain the second training data , Wherein each second data set corresponding to each data set identifier in the second training data will not cause the second training data to be unbalanced;
  • a model training module 706 is configured to use the second training data obtained by the data processing module 705 to train a classification model corresponding to the target device.
  • the data acquisition module 701 can be used to perform step 101 in the above method embodiment
  • the data judgment module 702 can be used to perform step 102 in the above method embodiment
  • the request sending module 703 can be used to perform the above method embodiment.
  • the instruction receiving module 704 can be used to perform step 104 in the above method embodiment
  • the data processing module 705 can be used to perform step 105 in the above method embodiment
  • the model training module 706 can be used to perform step 104 in the above method embodiment. Step 106.
  • the data judgment module 702 includes:
  • a clustering processing unit 7021 is configured to perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set In the same numerical range, the values of samples in different third data sets are in different numerical ranges;
  • a first judgment unit 7022 configured to determine the first training data balance when the cluster processing unit 7021 performs clustering processing on the first training data to obtain a third data set;
  • a second judging unit 7023 is configured to: when the clustering processing unit 7021 performs clustering processing on the first training data to obtain at least two third data sets, determine from the at least two third data sets the first with the least number of samples The fourth data set and the fifth data set including the largest number of samples, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, it is determined that the first training data is not balanced, if the fourth data The ratio of the number of samples in the fifth data set is greater than or equal to the proportion threshold, and the first training data balance is determined.
  • the cluster processing unit 7021 can be used to perform step 201 in the above method embodiment
  • the first judgment unit 7022 can be used to perform step 203 in the above method embodiment
  • the second judgment unit 7023 can be used to perform the above Steps 204 to 206 in the method embodiment.
  • the data processing module 705 includes:
  • a data collection unit 7051 is used to determine a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein the sample in the sixth data set is related to the data set identifier.
  • the values of the samples in the first data set identified are in the same value range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion Threshold
  • a data combination unit 7052 is used to combine the samples in each sixth data set determined by the data collection unit 7051 with the first training data to obtain second training data.
  • the data collection unit 7051 can be used to perform step 301 in the above method embodiment, and the data combination unit 7052 can be used to perform step 302 in the above method embodiment.
  • the data collection unit 7051 is configured to collect at least one sample from the first data set identified by the data set identifier for each data set identifier included in the equalization processing instruction, and include a sample containing the collected at least one sample Set as the sixth data set corresponding to the data set identifier.
  • the data collection unit 7051 may be used to execute step 401 and step 402 in the foregoing method embodiment.
  • the classification model training device may further include: a data reselection module 707;
  • the instruction receiving module 704 is further configured to receive a data reselection instruction sent by the user in response to the interaction request sent by the request sending module 703, where the data reselection instruction includes a data read address;
  • the data reselection module 707 is configured to read the third training data from the storage space corresponding to the data read address according to the data reselection instruction received by the instruction receiving module 704, and use the third training data as the first training data After that, the data judgment module 702 is triggered to judge whether the first training data is balanced.
  • the instruction receiving module 704 can be used to execute step 501 in the above method embodiment, and the data reselection module 707 can be used to execute step 502 and step 503 in the above method embodiment.
  • an embodiment of the present invention provides a classification model training device, including:
  • At least one processor 1102 coupled with the at least one memory 1101, when executing the executable instructions, is configured to:
  • the equalization processing instruction includes at least one data set identifier, and each of the at least one data set identifiers is used to identify the A first data set in the first training data that causes the first training data to be unbalanced, and the first data set identified by different data set identifiers is different;
  • the first training data is equalized for each of the first data set identified by the data set identifier to obtain second training data, wherein the second Each second data set corresponding to each of the data set identifiers in the training data will not cause the second training data to be unbalanced;
  • the at least one processor 1102 is further configured to: when executing the executable instruction:
  • the at least one processor 1102 is further configured to: when executing the executable instruction:
  • a sixth data set corresponding to the data set identifier is determined, wherein the sample in the sixth data set and the data set identifier identified
  • the numerical values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to The percentage threshold;
  • the at least one processor 1102 is further configured to: when executing the executable instruction:
  • the sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.
  • the at least one processor 1102 is further configured to: when executing the executable instruction:
  • the third training data is used as the first training data, and the judgment is performed to determine whether the first training data is balanced.
  • the present invention also provides a computer-readable medium that stores instructions for making a computer execute the classification model training method as described herein.
  • a system or device equipped with a storage medium may be provided, and the software program code for realizing the function of any one of the above embodiments is stored on the storage medium, and the computer (or CPU or MPU of the system or device) ) Read and execute the program code stored in the storage medium.
  • the program code itself read from the storage medium can realize the function of any one of the above embodiments, so the program code and the storage medium storing the program code constitute a part of the present invention.
  • Examples of storage media used to provide program codes include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Magnetic tape, non-volatile memory card and ROM.
  • the program code can be downloaded from the server computer via a communication network.
  • the program code read from the storage medium is written to the memory provided in the expansion board inserted into the computer or to the memory provided in the expansion unit connected to the computer, and then the program code is based on The instructions cause the CPU installed on the expansion board or the expansion unit to perform part or all of the actual operations, so as to realize the function of any one of the foregoing embodiments.
  • system structure described in the foregoing embodiments may be a physical structure or a logical structure. That is, some modules may be implemented by the same physical entity, or some modules may be implemented by multiple physical entities, or may be implemented by multiple physical entities. Some components in independent devices are implemented together.
  • the hardware unit can be implemented mechanically or electrically.
  • a hardware unit may include a permanent dedicated circuit or logic (such as a dedicated processor, FPGA or ASIC) to complete the corresponding operation.
  • the hardware unit may also include programmable logic or circuits (such as general-purpose processors or other programmable processors), which may be temporarily set by software to complete corresponding operations.
  • the specific implementation mode mechanical method, or dedicated permanent circuit, or temporarily set circuit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A classification model training method and device, and a computer-readable medium. The classification model training method comprises: acquiring first training data (101); determining whether the first training data is equalized (102); if the first training data is non-equalized, transmitting an interaction request to a user (103); receiving an equalization processing instruction sent by the user in response to the interaction request, wherein the equalization processing instruction comprises at least one data set identifier, and each data set identifier is used to identify a first data set in the first training data that causes the first training data to be unequalized (104); performing equalization processing, according to the equalization processing instruction, on the first training data for the first data set identified by each data set identifier, and acquiring second training data (105); and using the second training data to train a classification model corresponding to a target apparatus (106). The method improves the classification accuracy of a trained classification model.

Description

分类模型训练方法、装置和计算机可读介质Classification model training method, device and computer readable medium 技术领域Technical field
本发明涉及数据处理领域,尤其涉及分类模型训练方法、装置和计算机可读介质。The present invention relates to the field of data processing, in particular to a classification model training method, device and computer readable medium.
背景技术Background technique
在设备正常运行过程中,设备运行数据的取值会保持在一个正常的范围内,如果运行数据的取值超出正常范围则可能是设备出现故障所致,因此可以通过设备正常运行时的历史运行数据来训练分类模型,进而可以实时将设备的运行数据输入分类模型,由分类模型来判断设备运行过程中是否出现故障。比如,炼化设备中包括有用于传输液体的管道,通过监测管道内液体流速、压力和温度等信息可以确定管道的工作状况,例如,当管道发生堵塞时流速数据会减小,当管道发生泄漏时压力数据会减小。During the normal operation of the equipment, the value of the equipment operating data will remain within a normal range. If the value of the operating data exceeds the normal range, it may be caused by the equipment failure. Therefore, the historical operation of the equipment during normal operation The data is used to train the classification model, and then the operating data of the equipment can be input into the classification model in real time, and the classification model can determine whether there is a failure during the operation of the equipment. For example, the refining equipment includes pipelines for transferring liquids. The working conditions of the pipelines can be determined by monitoring the liquid flow rate, pressure and temperature in the pipelines. For example, the flow rate data will decrease when the pipeline is blocked, and when the pipeline leaks Time pressure data will decrease.
在将设备的历史运行数据作为训练数据来训练分类模型时,设备可能具有多种不同的运行模式,而训练数据可能为设备处于多个运行模式下的历史运行数据,由于设备在不同运行模式下所产生的运行数据不同,因此训练数据可能存在不均衡的情况。比如,训练数据由第一数据集和第二数据集组成,其中,第一数据集中包括设备在第一运行模式下运行时的历史运行数据,第二数据集中包括设备子第二运行模式下运行是的历史运行数据,在一种可能的情况下,第一数据集中样本的数量远小于第二数据集,在此情况下训练数据不均衡。其中,数据不均衡判断所使用的判决条件可依实际情况而定,比如第一数据集中样本的数量与第二数据集样本的数量的比例小于预设的门限值时,确定训练数据不均衡。When the historical operating data of the equipment is used as training data to train the classification model, the equipment may have a variety of different operating modes, and the training data may be historical operating data of the equipment in multiple operating modes, because the equipment is in different operating modes The generated running data is different, so the training data may be unbalanced. For example, the training data consists of a first data set and a second data set, where the first data set includes historical operating data when the device is running in the first operating mode, and the second data set includes the device running in the second operating mode. Yes, historical running data. In a possible situation, the number of samples in the first data set is much smaller than that in the second data set. In this case, the training data is not balanced. Among them, the judgment conditions used to judge the data imbalance can be determined according to the actual situation. For example, when the ratio of the number of samples in the first data set to the number of samples in the second data set is less than a preset threshold, it is determined that the training data is not balanced .
目前,直接利用获取到的训练数据来训练分类模型,由于训练数据可能不均衡,如果利用不均衡的训练数据训练分类模型,可能导致的一种情况是,利用所训练出的分类模型实时对设备的运行数据进行分析时,基于落入前述第一数据集所对应数值范围内的运行数据会错误地得到设备异常的结论,从而在利用分类模型判断设备运行状况时会产生大量误报,分类模型的分类准确率较低。At present, the training data obtained is directly used to train the classification model. Since the training data may be unbalanced, if the unbalanced training data is used to train the classification model, one situation that may result is to use the trained classification model to perform real-time monitoring of the device. When analyzing the operating data of the above-mentioned first data set, the operating data falling within the corresponding numerical range of the first data set will erroneously conclude that the equipment is abnormal, so that a large number of false alarms will be generated when the classification model is used to determine the operating status of the equipment. The classification accuracy is low.
发明内容Summary of the invention
有鉴于此,本发明提供的分类模型训练方法、装置和计算机可读介质,能够提高所 训练分类模型的分类准确率。In view of this, the classification model training method, device and computer readable medium provided by the present invention can improve the classification accuracy of the trained classification model.
第一方面,本发明实施例提供了一种分类模型训练方法,包括:In the first aspect, an embodiment of the present invention provides a classification model training method, including:
获取第一训练数据,其中,第一训练数据包括目标设备的历史运行数据;Acquiring first training data, where the first training data includes historical operating data of the target device;
判断第一训练数据是否均衡;Determine whether the first training data is balanced;
如果第一训练数据不均衡,则向用户发送交互请求,其中,交互请求用于请求用户确定对第一训练数据进行处理的方式;If the first training data is unbalanced, send an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;
接收用户响应于交互请求的均衡化处理指令,其中,均衡化处理指令包括至少一个数据集标识,至少一个数据集标识中的每一个数据集标识用于标识第一训练数据中导致第一训练数据不均衡的一个第一数据集,不同数据集标识所标识的第一数据集不同;Receive a balance processing instruction from a user in response to an interaction request, where the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data An unbalanced first data set, the first data set identified by different data set identifiers is different;
根据均衡化处理指令,分别针对每一个数据集标识所标识的第一数据集对第一训练数据进行均衡化处理,获得第二训练数据,其中,第二训练数据中与各个数据集标识所对应的各个第二数据集不会导致第二训练数据不均衡;According to the equalization processing instruction, the first training data is equalized for the first data set identified by each data set identifier to obtain the second training data, wherein the second training data corresponds to each data set identifier Each second data set of will not cause the second training data to be unbalanced;
利用第二训练数据训练与目标设备相对应的分类模型。The second training data is used to train a classification model corresponding to the target device.
在获取到第一训练数据后,判断第一训练数据是否均衡,如果第一训练数据不均衡则向用户发送交互请求,请求用户确定对第一训练数据进行处理的方式,当接收到用户响应于交互请求的均衡化处理指令后,根据均衡化处理指令对第一训练数据进行均衡化处理而获得第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,之后利用第二训练数据来训练分类模型。由此可见,将不均衡的第一训练数据转换为第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,从而利用第二训练数据训练出的分类模型不会基于目标设备正常运行时的运行数据错误地给出目标设备异常的分析结果,保证所训练出的分类模型具有较高的分类准确率。After acquiring the first training data, determine whether the first training data is balanced. If the first training data is not balanced, send an interactive request to the user, requesting the user to determine the way to process the first training data. When the user responds After the equalization processing instruction of the interactive request, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device in the normal operation of the second training data will not cause the second training data. The training data is not balanced, and then the second training data is used to train the classification model. It can be seen that the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used The trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.
可选地,在判断第一训练数据是否均衡时,首先对第一训练数据进行聚类处理获得至少一个第三数据集,其中每个第三数据集中包括有至少一个样本,同一第三数据集中各个样本的数值位于同一数值范围,而不同第三数据集中样本的数值位于不同的数值范围,之后判断聚类处理是否仅获得一个第三数据集,如果仅获得一个第三数据集则确定第一训练数据均衡,如果获得至少两个第三数据集则从各个第三数据集中确定出包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,之后判断第四数据集与第五数据集中样本个数的比值是否小于预设的占比阈值,如果第四数据集与第五数据集中样本个数的比值小于占比阈值则确定第一训练数据不均衡,如果第四数据集与第五数据集中样本个数的比值大于或等于占比阈值则确定第一训练数据均衡。Optionally, when determining whether the first training data is balanced, first perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the same third data set The values of each sample are in the same numerical range, but the values of the samples in different third data sets are in different numerical ranges. After that, it is judged whether the clustering process only obtains a third data set. If only one third data set is obtained, the first The training data is balanced. If at least two third data sets are obtained, the fourth data set including the smallest number of samples and the fifth data set including the largest number of samples are determined from each third data set, and then the fourth data set is determined Whether the ratio of the number of samples in the fifth data set to the number of samples in the fifth data set is less than the preset proportion threshold, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold, it is determined that the first training data is not balanced. If the ratio of the number of samples in the data set to the fifth data set is greater than or equal to the proportion threshold, the first training data is determined to be balanced.
通过对第一训练数据进行聚类处理,将第一训练数据包括的各个样本聚类为一个或多个第三数据集,即将取值位于相同数值范围的各个样本归类到同一个第三数据集中。如果聚类处理仅获得一个第三数据集,说明第一训练数据中各个样本的取值位于同一数值范围内,即第一训练数据均衡。如果聚类处理获得多个第三数据集,则计算包括样本个数最少的第三数据集与包括样本个数最多的第三数据集中样本个数的比值,如果计算出的比值小于预先设定的占比阈值,说明不同第三数据集中样本个数差距较大,第一训练数据中样本分布不均衡,确定第一训练数据不均衡,反之确定第一训练数据均衡。根据第三数据集的数量以及第三数据集中样本的数量分两个阶段判断第一训练数据是否均衡,保证能够准确地对第一训练数据的均衡性进行判断。By clustering the first training data, each sample included in the first training data is clustered into one or more third data sets, that is, each sample with a value in the same numerical range is classified into the same third data concentrated. If the clustering process only obtains a third data set, it means that the values of each sample in the first training data are within the same numerical range, that is, the first training data is balanced. If the clustering process obtains multiple third data sets, calculate the ratio of the third data set with the smallest number of samples to the third data set with the largest number of samples. If the calculated ratio is less than the preset The percentage threshold of, indicates that the number of samples in the different third data sets is quite different, the sample distribution in the first training data is not balanced, it is determined that the first training data is not balanced, and vice versa. According to the number of the third data set and the number of samples in the third data set, it is determined whether the first training data is balanced in two stages, so as to ensure that the balance of the first training data can be accurately judged.
可选地,在对第一训练数据进行均衡化处理获得第二训练数据时,针对均衡化处理指令所包括的每一个数据集标识,确定一个与该数据集标识相对应的第六数据集,使得第六数据集中样本的取值与该数据集标识所标识的第一数据集中样本的取值位于同一数值范围内,并且需要保证该数据集标识所标识的第一数据集与该数据集标识所对应的第六数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值。在确定出各个数据集标识所对应的第六数据集后,将确定出的各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据。Optionally, when performing equalization processing on the first training data to obtain the second training data, for each data set identifier included in the equalization processing instruction, a sixth data set corresponding to the data set identifier is determined, The value of the sample in the sixth data set is within the same numerical range as the value of the sample in the first data set identified by the data set identifier, and it is necessary to ensure that the first data set identified by the data set identifier and the data set identifier The ratio of the total number of samples in the sixth data set to the number of samples in the fifth data set is greater than or equal to the proportion threshold. After the sixth data set corresponding to each data set identifier is determined, the determined samples in each sixth data set are combined with the first training data to obtain the second training data.
第一数据集为用户确定的导致第一训练数据不均衡且其所包括样本为目标设备正常运行时的历史运行数据,由于第一数据集中包括样本的个数较少导致第一训练数据不均衡,为此针对每一个第一数据集确定一个相对应的第六数据集,使得第六数据集中样本的取值与相对应第一数据集中样本的取值位于同一数值范围内,这样在将第六数据集中的样本与第一训练数据进行组合,实质上是对第一数据集中的样本进行扩充,从而使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡。The first data set is determined by the user to cause the first training data to be unbalanced and the included samples are historical operating data when the target device is running normally. The first training data is unbalanced due to the small number of samples included in the first data set Therefore, a corresponding sixth data set is determined for each first data set, so that the value of the sample in the sixth data set and the value of the sample in the first data set are in the same numerical range, so that the Combining the samples in the sixth data set with the first training data essentially expands the samples in the first data set, so that the historical operating data of the target device in the second training data during normal operation will not cause the second training data to fail. balanced.
可选地,在针对一个数据集标识确定与该数据集标识相对应的第六数据集时,可以从该数据集标识所标识的第一数据集中采集至少一个样本,进而将采集到的各个样本的组合作为该数据集标识所对应的第六数据集。Optionally, when the sixth data set corresponding to the data set identifier is determined for a data set identifier, at least one sample may be collected from the first data set identified by the data set identifier, and then each sample collected The combination of is used as the sixth data set corresponding to the data set identifier.
在确定与数据集标识相对应的第六数据集时,要求所确定第六数据集中样本的取值与该数据集标识所标识第一数据集中样本的取值位于同一数值范围内,为此可以直接从数据集标识所标识的第一数据集中采集样本作为与该数据集标识相对应的第六数据集,这样可以更加方便地对第一训练数据进行均衡化处理,由于无需再另外寻找其他的历史运行数据,从而可以提高分别模型训练的效率。When determining the sixth data set corresponding to the data set identifier, the value of the sample in the determined sixth data set is required to be within the same numerical range as the value of the sample in the first data set identified by the data set identifier. Collect samples directly from the first data set identified by the data set identifier as the sixth data set corresponding to the data set identifier, so that the first training data can be equalized more conveniently, because there is no need to find other Historical operating data can improve the efficiency of separate model training.
可选地,在确定第一训练数据不均衡并向用户发送交互请求后,如果接收到用户响 应于交互请求的数据重选指令,则根据数据重选指令所包括的数据读取地址从相应的存储空间读取第三训练数据,之后将第三训练数据作为第一训练数据重新开始对第一训练数据的均衡性进行判断。Optionally, after determining that the first training data is unbalanced and sending an interaction request to the user, if a data reselection instruction from the user in response to the interaction request is received, the data read address included in the data reselection instruction is read from the corresponding The storage space reads the third training data, and then uses the third training data as the first training data to restart the judgment on the balance of the first training data.
在确定第一训练数据不均衡后,如果用户发送数据重选指令以指示重新选择训练数据,则将此前获取到的第一训练数据舍弃,并根据数据重选指令重新获取第一训练数据。这样,当所获取的第一训练数据的不均衡会导致所训练出的分类模型不准确时,可以重新选择样本分布更加均衡的第一训练数据来训练分类模型,可以满足不同用户的使用需求,提升该分类模型训练方法的适用性。After determining that the first training data is unbalanced, if the user sends a data reselection instruction to instruct to reselect the training data, the first training data obtained before is discarded, and the first training data is reacquired according to the data reselection instruction. In this way, when the imbalance of the obtained first training data will lead to the inaccuracy of the trained classification model, the first training data with more balanced sample distribution can be re-selected to train the classification model, which can meet the needs of different users and improve The applicability of the classification model training method.
第二方面,本发明所述还提供了一种分类模型训练装置,包括:In the second aspect, the present invention also provides a classification model training device, including:
一个数据获取模块,用于获取第一训练数据,其中,第一训练数据包括目标设备的历史运行数据;A data acquisition module for acquiring first training data, where the first training data includes historical operating data of the target device;
一个数据判断模块,用于判断数据获取模块获取到的第一训练数据是否均衡;A data judgment module for judging whether the first training data obtained by the data obtaining module is balanced;
一个请求发送模块,用于根据数据判断模块的判断结果,如果第一训练数据不均衡,则向用户发送交互请求,其中,交互请求用于请求用户确定对第一训练数据进行处理的方式;A request sending module is used to send an interactive request to the user if the first training data is unbalanced according to the judgment result of the data judgment module, where the interactive request is used to request the user to determine the way to process the first training data;
一个指令接收模块,用于接收用户响应于请求发送模块所发送交互请求的均衡化处理指令,其中,均衡化处理指令包括至少一个数据集标识,至少一个数据集标识中的每一个数据集标识用于标识第一训练数据中导致第一训练数据不均衡的一个第一数据集,不同数据集标识所标识的第一数据集不同;An instruction receiving module for receiving a balance processing instruction from a user in response to an interaction request sent by the request sending module, wherein the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used for In identifying a first data set in the first training data that causes the first training data to be unbalanced, the first data sets identified by different data set identifiers are different;
一个数据处理模块,用于根据指令接收模块接收到的均衡化处理指令,分别针对每一个数据集标识所标识的第一数据集对第一训练数据进行均衡化处理,获得第二训练数据,其中,第二训练数据中与各个数据集标识所对应的各个第二数据集不会导致第二训练数据不均衡;A data processing module is configured to perform equalization processing on the first training data for the first data set identified by each data set identifier according to the equalization processing instruction received by the instruction receiving module to obtain the second training data, wherein , Each second data set corresponding to each data set identifier in the second training data will not cause the second training data to be unbalanced;
一个模型训练模块,用于利用数据处理模块获取到的第二训练数据训练与目标设备相对应的分类模型。A model training module is used to train the classification model corresponding to the target device using the second training data obtained by the data processing module.
数据获取模块获取到第一训练数据后,数据判断模块确定第一训练数据是否均衡,请求发送模块根据数据判断模块的判断结果,在数据判断模块判断第一训练数据不均衡后向用户发送交互请求,当指令接收模块接收到用户响应于请求发送模块所发送交互请求的均衡化处理指令后,数据处理模块针对均衡化处理指令中每一个数据集标识所标识的第一数据集对第一训练数据进行均衡化处理,获得与数据集标识相对应的第二数据集 不会导致不均衡的第二训练数据,模型训练模块利用数据处理模块获取到的第二训练数据来训练与目标设备相对应的分类模型。在模型训练模块训练分类模型之前,如果第一训练数据中由于对应目标设备正常运行时历史运行数据的样本较少导致第一训练数据不均衡,则将第一训练数据转换为第二训练数据,使得目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,之后利用第二训练数据来训练与目标设备相对应的分类模型,保证所训练出的分类模型不会基于目标设备正常运行时的运行数据判定目标设备异常,从而可以提高所训练出的分类模型的分类准确率。After the data acquisition module obtains the first training data, the data judgment module determines whether the first training data is balanced, and requests the sending module to send an interaction request to the user after the data judgment module judges that the first training data is unbalanced according to the judgment result of the data judgment module When the instruction receiving module receives the equalization processing instruction of the user in response to the interaction request sent by the request sending module, the data processing module compares the first data set identified by each data set identifier in the equalization processing instruction to the first training data Perform equalization processing to obtain a second data set corresponding to the data set identifier that does not cause unbalanced second training data. The model training module uses the second training data obtained by the data processing module to train the corresponding target device Classification model. Before the model training module trains the classification model, if the first training data is unbalanced due to the small number of historical operating data samples corresponding to the normal operation of the target device in the first training data, the first training data is converted to the second training data. So that the historical operating data of the target device during normal operation will not cause the second training data to be unbalanced, and then use the second training data to train the classification model corresponding to the target device to ensure that the trained classification model will not be based on the normal target device The running data at runtime determines that the target device is abnormal, which can improve the classification accuracy of the trained classification model.
可选地,数据判断模块包括:Optionally, the data judgment module includes:
一个聚类处理单元,用于对第一训练数据进行聚类处理,获得至少一个第三数据集,其中,每个第三数据集包括有至少一个样本,同一第三数据集中各个样本的数值位于同一数值范围内,不同第三数据集中样本的数值位于不同的数值范围;A clustering processing unit is used to perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in Within the same numerical range, the values of samples in different third data sets are in different numerical ranges;
一个第一判断单元,用于在聚类处理单元对第一训练数据进行聚类处理获得一个第三数据集时,确定第一训练数据均衡;A first judging unit, configured to determine the balance of the first training data when the clustering processing unit performs clustering processing on the first training data to obtain a third data set;
一个第二判断单元,用于在聚类处理单元对第一训练数据进行聚类处理获得至少两个第三数据集时,从至少两个第三数据集中确定包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,如果第四数据集与第五数据集中样本个数的比值小于预设的占比阈值,确定第一训练数据不均衡,如果第四数据集与第五数据集中样本个数的比值大于或等于占比阈值,确定第一训练数据均衡。A second judging unit for determining the fourth data with the least number of samples from the at least two third data sets when the clustering processing unit performs clustering processing on the first training data to obtain at least two third data sets If the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, it is determined that the first training data is not balanced. If the fourth data set is The ratio of the number of samples in the fifth data set is greater than or equal to the proportion threshold, and the first training data balance is determined.
聚类处理单元可以将第一训练数据聚类为一个或多个第三数据集,使得每一个第三数据集中样本的取值具有相同的数值范围,第一判断单元和第二判断单元基于聚类处理单元所获得的第三数据集的数量进行后续处理。如果聚类处理单元仅获得一个第三数据集,第一判断单元确定第一训练数据均衡。如果聚类处理单元获得至少两个第三数据集,第二判断单元首先从各个第三数据集中确定出包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,之后判断第四数据集与第五数据集中样本个数的比值是否小于预设的占比阈值,如果是则确定第一训练数据不均衡,否则确定第一训练数据均衡。第一判断单元和第二判断单元基于聚类处理单元进行聚类处理的结果,分两个阶段确定第一训练数据是否均衡,判断过程结合第三数据集的数量和第三数据集中样本的数量,保证对第一训练数据进行均衡性判断的准确性。The clustering processing unit may cluster the first training data into one or more third data sets, so that the values of samples in each third data set have the same numerical range. The first judgment unit and the second judgment unit are based on the aggregation The quantity of the third data set obtained by the class processing unit is subjected to subsequent processing. If the clustering processing unit obtains only one third data set, the first judgment unit determines that the first training data is balanced. If the clustering processing unit obtains at least two third data sets, the second judging unit first determines from each third data set the fourth data set including the smallest number of samples and the fifth data set including the largest number of samples, and then It is determined whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, and if so, it is determined that the first training data is unbalanced; otherwise, it is determined that the first training data is balanced. The first judgment unit and the second judgment unit determine whether the first training data is balanced in two stages based on the results of the clustering processing performed by the clustering processing unit. The judgment process combines the number of the third data set and the number of samples in the third data set , To ensure the accuracy of the balance judgment on the first training data.
可选地,数据处理模块包括:Optionally, the data processing module includes:
一个数据采集单元,用于针对均衡化处理指令包括的每一个数据集标识,确定一个与该数据集标识相对应的第六数据集,其中,该第六数据集中样本与该数据集标识所标 识的第一数据集中样本的数值位于同一数值范围,且该数据集标识所标识的第一数据集与该第六数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值;A data collection unit for determining a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein the sample in the sixth data set is identified by the data set identifier The values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion threshold ;
一个数据组合单元,用于将数据采集单元确定出的各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据。A data combination unit is used to combine samples in each sixth data set determined by the data collection unit with the first training data to obtain second training data.
针对均衡化处理指令所包括的每一个数据集标识,数据采集单元可以确定一个与该数据集标识相对应的第六数据集,使得第六数据集中样本的取值与该数据集标识所标识第一数据集中样本的取值位于同一数值范围内,并保证第六数据集与该数据集标识所标识第一数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值。数据组合单元可以将数据采集单元确定出的各个第六数据集中的样本与第一训练数据进行组合,进而获得组合出的第二训练数据。For each data set identifier included in the equalization processing instruction, the data collection unit may determine a sixth data set corresponding to the data set identifier, so that the value of the sample in the sixth data set is the same as the first data set identifier identified by the data set identifier. The values of samples in a data set are within the same numerical range, and it is ensured that the ratio of the total number of samples in the first data set to the number of samples in the fifth data set in the sixth data set and the data set identifier is greater than or equal to the proportion threshold. The data combination unit may combine the samples in each sixth data set determined by the data collection unit with the first training data to obtain the combined second training data.
针对每一个数据集标识,数据采集单元基于该数据集标识所标识第一数据集中样本取值所在的数值范围,确定包括有至少一个样本的第六数据集来扩充该数据集标识所标识的第一数据集,在数据组合单元将各个第六数据集中的样本与第一训练数据组合后,第二训练数据集中与该数据集标识相对应的第二数据集中的样本为该数据集标识所对应第六数据集与该数据集标识所标识第一数据集中样本的总和,使得与各个数据集标识所对应的第二数据集不会导致第二训练数据不均衡,实现通过均衡化处理将不均衡的第一训练数据转换为第二训练数据。For each data set identifier, the data collection unit determines a sixth data set that includes at least one sample based on the value range of the sample in the first data set identified by the data set identifier to expand the first data set identified by the data set identifier. A data set, after the data combination unit combines the samples in each sixth data set with the first training data, the samples in the second data set corresponding to the data set identifier in the second training data set correspond to the data set identifier The sixth data set and the sum of the samples in the first data set identified by the data set identifier, so that the second data set corresponding to each data set identifier will not cause the second training data to be unbalanced, so that the unevenness will be achieved through equalization processing Convert the first training data to the second training data.
可选地,针对每一个数据集标识,数据采集单元可以从该数据集标识所标识的第一数据集中采集样本,进而将所采集到的样本的集合作为与该数据集标识相对应的第六数据集。Optionally, for each data set identifier, the data collection unit may collect samples from the first data set identified by the data set identifier, and then use the collected sample set as the sixth data set identifier corresponding to the data set identifier. data set.
数据采集单元直接从数据集标识所标识的第一数据集中采集样本,进而将采集到的样本的集合作为与数据集标识相对应的第六数据集,由于对应于同一个数据集标识的第一数据集和第六数据集中的样本组合形成了第二训练数据中与该数据集标识相对应的第二数据集,这样可以保证第六数据集中样本的取值与相对应第一数据集中样本的取值位于同一数值范围,保证对第一训练数据进行均衡化处理的有效性和准确性。The data collection unit directly collects samples from the first data set identified by the data set identifier, and then uses the collected sample set as the sixth data set corresponding to the data set identifier, because it corresponds to the first data set identified by the same data set. The data set and the samples in the sixth data set are combined to form the second data set corresponding to the data set identifier in the second training data. This ensures that the value of the sample in the sixth data set is the same as the value of the sample in the first data set. The values are in the same numerical range to ensure the effectiveness and accuracy of equalizing the first training data.
可选地,分类模型训练装置可以进一步包括数据重选模块,当指令接收模块接收到用户响应于交互请求的数据重选指令后,数据重选模块可以根据数据重选指令包括的数据读取地址从相应的存储空间读取第三训练数据,进而将第三训练数据作为第一训练数据后触发数据判断模块对新的第一训练数据的均衡性进行验证。Optionally, the classification model training device may further include a data reselection module. After the instruction receiving module receives the data reselection instruction from the user in response to the interaction request, the data reselection module may read the data according to the data read address included in the data reselection instruction. The third training data is read from the corresponding storage space, and the third training data is used as the first training data to trigger the data judgment module to verify the balance of the new first training data.
数据重选模块可以根据用户的指令重新选择训练数据,以满足用户舍弃此前的第一训练数据重新选择用于训练分类模型的训练数据的需求,从而可以满足不同用的使用需 求,有助于提升该分类模型训练在的适用性。The data reselection module can reselect the training data according to the user's instructions to meet the user's need to discard the previous first training data and reselect the training data used to train the classification model, so as to meet different usage needs and help improve The applicability of the classification model for training.
第三方面,本发明实施例还提供了一种分类模型训练装置,包括:至少一个存储器和至少一个处理器;In a third aspect, an embodiment of the present invention also provides a classification model training device, including: at least one memory and at least one processor;
至少一个存储器,用于存储机器可读程序;At least one memory for storing machine-readable programs;
至少一个处理器,用于调用机器可读程序,执行上述第一方面或第一方面的任一可能实现方式提供的方法。At least one processor is configured to invoke a machine-readable program to execute the foregoing first aspect or the method provided in any possible implementation manner of the first aspect.
其中,存储器中存储有机器可读程序,处理器通过调用存储器中存储的机器可读程序,可以执行上述第一方面或第一方面的任一可实现方式所提供的方法,在获取到第一训练数据后,判断第一训练数据是否均衡,如果第一训练数据不均衡则向用户发送交互请求,请求用户确定对第一训练数据进行处理的方式,当接收到用户响应于交互请求的均衡化处理指令后,根据均衡化处理指令对第一训练数据进行均衡化处理而获得第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,之后利用第二训练数据来训练分类模型。由此可见,将不均衡的第一训练数据转换为第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,从而利用第二训练数据训练出的分类模型不会基于目标设备正常运行时的运行数据错误地给出目标设备异常的分析结果,保证所训练出的分类模型具有较高的分类准确率。Wherein, a machine-readable program is stored in the memory, and the processor can execute the above-mentioned first aspect or the method provided in any implementable manner of the first aspect by calling the machine-readable program stored in the memory, and after obtaining the first aspect After the training data, determine whether the first training data is balanced. If the first training data is not balanced, send an interactive request to the user, requesting the user to determine the way to process the first training data. When the user responds to the equalization of the interactive request After the instruction is processed, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device during normal operation in the second training data will not cause the second training data to be unbalanced, Then use the second training data to train the classification model. It can be seen that the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used The trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.
第四方面,本发明实施例还提供了一种计算机可读介质,计算机可读介质上存储有计算机指令,计算机指令在被处理器执行时,使处理器执行上述第一方面或第一方面的任一种可能的实现方式所提供的方法。In a fourth aspect, an embodiment of the present invention also provides a computer-readable medium on which computer instructions are stored. When the computer instructions are executed by a processor, the processor executes the first aspect or the first aspect described above. The method provided by any possible implementation.
其中,机器可读介质上存储有计算机指令,当计算机指令被处理器执行时,处理器会执行上述第一方面以及第一方面的任一可能的实现方式所提供的分布模型训练方法,在获取到第一训练数据后,判断第一训练数据是否均衡,如果第一训练数据不均衡则向用户发送交互请求,请求用户确定对第一训练数据进行处理的方式,当接收到用户响应于交互请求的均衡化处理指令后,根据均衡化处理指令对第一训练数据进行均衡化处理而获得第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,之后利用第二训练数据来训练分类模型。由此可见,将不均衡的第一训练数据转换为第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,从而利用第二训练数据训练出的分类模型不会 基于目标设备正常运行时的运行数据错误地给出目标设备异常的分析结果,保证所训练出的分类模型具有较高的分类准确率。Wherein, computer instructions are stored on the machine-readable medium, and when the computer instructions are executed by the processor, the processor will execute the distributed model training method provided by the first aspect and any possible implementation of the first aspect, and obtain After the first training data is received, it is judged whether the first training data is balanced. If the first training data is not balanced, an interactive request is sent to the user, requesting the user to determine the way to process the first training data, when the user responds to the interactive request After the equalization processing instruction of the equalization processing instruction, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device in the normal operation of the second training data will not cause the second training data Unbalanced, and then use the second training data to train the classification model. It can be seen that the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used The trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.
附图说明Description of the drawings
本发明的其它特征、特点、优点和益处通过以下结合附图的详细描述将变得更加显而易见。Other features, characteristics, advantages and benefits of the present invention will become more apparent through the following detailed description in conjunction with the accompanying drawings.
图1是本发明一个实施例提供的一种分类模型训练方法的流程图;Figure 1 is a flowchart of a classification model training method provided by an embodiment of the present invention;
图2是本发明一个实施例提供的一种第一训练数据均衡性判断方法的流程图;2 is a flowchart of a first training data balance judgment method according to an embodiment of the present invention;
图3是本发明一个实施例提供的一种均衡化处理第一训练数据方法的流程图;3 is a flowchart of a method for equalizing and processing first training data according to an embodiment of the present invention;
图4是本发明一个实施例提供的一种第六数据集确定方法的流程图;FIG. 4 is a flowchart of a sixth data set determination method according to an embodiment of the present invention;
图5是本发明一个实施例提供的一种训练数据重选方法的流程图;FIG. 5 is a flowchart of a method for reselecting training data according to an embodiment of the present invention;
图6是本发明一个实施例提供的另一种分类模型训练方法的流程图;Fig. 6 is a flowchart of another classification model training method provided by an embodiment of the present invention;
图7是本发明一个实施例提供的一种分类模型训练装置的示意图;FIG. 7 is a schematic diagram of a classification model training device provided by an embodiment of the present invention;
图8是本发明一个实施例提供的另一种分类模型训练装置的示意图;Fig. 8 is a schematic diagram of another classification model training device provided by an embodiment of the present invention;
图9是本发明一个实施例提供的又一种分类模型训练装置的示意图;9 is a schematic diagram of another classification model training device provided by an embodiment of the present invention;
图10是本发明一个实施例提供的一种包括数据重新模块的分类模型训练装置的示意图;Fig. 10 is a schematic diagram of a classification model training device including a data re-module provided by an embodiment of the present invention;
图11是本发明一个实施例提供的再一种分类模型训练装置的示意图。Fig. 11 is a schematic diagram of still another classification model training device provided by an embodiment of the present invention.
附图标记列表:List of reference signs:
101:获取第一训练数据101: Obtain the first training data
102:判断第一训练数据是否均衡102: Determine whether the first training data is balanced
103:在第一训练数据不均衡时向用户发送交互请求103: Send an interactive request to the user when the first training data is not balanced
104:接收用户响应于交互请求的均衡化处理指令104: Receive the equalization processing instruction from the user in response to the interaction request
105:根据均衡化处理指令对第一训练数据进行均衡化处理获得第二训练数据105: Perform equalization processing on the first training data according to the equalization processing instruction to obtain second training data
106:利用第二训练数据训练分类模型106: Use the second training data to train the classification model
201:对第一训练数据进行聚类处理获得至少一个第三数据集201: Perform clustering processing on the first training data to obtain at least one third data set
202:判断聚类处理是否仅获得一个第三数据集202: Determine whether the clustering process only obtains a third data set
203:确定第一训练数据均衡203: Determine the first training data balance
204:从各个第三数据集中确定第四数据集和第五数据集204: Determine the fourth data set and the fifth data set from each third data set
205:判断第四数据集与第五数据集中样本个数的比值是否小于占比阈值205: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold
206:确定个第一训练数据不均衡206: Determine the first training data is not balanced
301:分别确定与每一个数据集标识相对应的第六数据集301: Determine the sixth data set corresponding to each data set identifier respectively
302:将各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据302: Combine samples in each sixth data set with the first training data to obtain second training data
401:从数据集标识所标识的第一数据集中采集至少一个样本401: Collect at least one sample from the first data set identified by the data set identifier
402:将包括有所采集各个样本的样本集合作为与数据集标识相对应的第六数据集402: Use the sample set including the collected samples as the sixth data set corresponding to the data set identifier
501:接收用户响应于交互请求的数据重选指令501: Receive a data reselection instruction from the user in response to an interactive request
502:根据数据重选指令读取第三训练数据502: Read the third training data according to the data reselection instruction
503:将第三训练数据作为第一训练数据,并执行步骤102503: Use the third training data as the first training data, and perform step 102
601:获取第一训练数据601: Obtain the first training data
602:对第一训练数据进行聚类处理,获得至少一个第三数据集602: Perform clustering processing on the first training data to obtain at least one third data set
603:判断是否仅获得一个第三数据集603: Determine whether to obtain only one third data set
604:利用第一训练数据训练分类模型604: Use the first training data to train a classification model
605:从各个第三数据集中确定第四数据集和第五数据集605: Determine the fourth data set and the fifth data set from each third data set
606:判断第四数据集与第五数据集中样本个数的比值是否小于占比阈值606: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold
607:向用户发送交互请求607: Send an interactive request to the user
608:判断是否接收到用户响应于交互请求的模型训练指令608: Determine whether a model training instruction from the user in response to the interaction request is received
609:判断是否接收到用户响应于交互请求的数据重选指令609: Determine whether a data reselection instruction from the user in response to the interaction request is received
610:根据数据重选指令重新获取第一训练数据610: Re-acquire the first training data according to the data reselection instruction
611:判断是否接收到用户响应于交互请求的均衡化处理指令611: Determine whether an equalization processing instruction from the user in response to the interaction request is received
612:分别确定均衡化处理指令中每一个数据集标识对应的第六数据集612: Determine respectively the sixth data set corresponding to each data set identifier in the equalization processing instruction
613:将各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据613: Combine samples in each sixth data set with the first training data to obtain second training data
614:利用第二训练数据训练分类模型614: Use the second training data to train the classification model
615:结束当前流程615: End the current process
701:数据获取模块        702:数据判断模块       703:请求发送模块701: Data acquisition module 702: Data judgment module 703: Request to send module
704:指令接收模块        705:数据处理模块       706:模型训练模块704: Instruction receiving module 705: Data processing module 706: Model training module
707:数据重选模块        7021:聚类处理单元      7022:第一判断单元707: Data Reselection Module 7021: Clustering Processing Unit 7022: First Judging Unit
7023:第二判断单元       7051:数据采集单元      7052:数据组合单元7023: Second judgment unit 7051: Data collection unit 7052: Data combination unit
1101:存储器             1102:处理器1101: Memory 1102: Processor
具体实施方式Detailed ways
如前所述,直接将设备的历史运行数据作为训练数据来训练分类模型,当训练数据 不均衡时,如果训练数据全部为设备正常运行时的历史运行数据,所训练出的分类模型会认为训练数据中包括少量样本的第一数据集中的样本为设备异常时所产生的运行数据,在利用这样的分类模型实时对设备的运行数据进行分析时,基于数值落入第一数据集所对应数值范围内的正常运行数据,分类模型会得出设备异常的结论,从而分类模型会产生大量误报,使得分类模型的分类准确率较低。As mentioned earlier, the historical operating data of the device is directly used as the training data to train the classification model. When the training data is not balanced, if the training data is all historical operating data of the device during normal operation, the trained classification model will consider the training The samples in the first data set that include a small number of samples are operating data generated when the device is abnormal. When using such a classification model to analyze the operating data of the device in real time, it is based on the value falling into the corresponding value range of the first data set The classification model will conclude that the equipment is abnormal from the normal operating data in the data, so the classification model will generate a large number of false alarms, making the classification accuracy of the classification model low.
本发明实施例中,在获取到用于训练分类模型的第一训练数据后,首先判断第一训练数据是否均衡,当确定第一训练数据不均衡时向用户发送交互请求,由用户确定对第一训练数据进行处理的方式,之后如果接收到用户响应于交互请求的均衡化处理指令,则根据均衡化处理指令将第一训练数据均衡化处理为第二训练数据,以使目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,之后利用第二训练数据来训练与目标设备相对应的分类模型。由此可见,当存在设备正常运行时的历史运行数据导致第一训练数据不均衡时,通过对第一训练数据进行均衡化处理而获得第二训练数据,使得设备正常运行时的历史运行数据不会导致第二训练数据不均衡,进而降低利用第二训练数据所训练出的分类模型的误报概率,提高所训练出的分类模型的分类准确率。In the embodiment of the present invention, after acquiring the first training data used to train the classification model, it is first judged whether the first training data is balanced, and when it is determined that the first training data is not balanced, an interaction request is sent to the user, and the user determines that the first training data is not balanced. A method of processing training data. Afterwards, if a user's equalization processing instruction in response to an interactive request is received, the first training data is equalized into the second training data according to the equalization processing instruction, so that the target device can run normally The historical running data of ”will not cause the second training data to be unbalanced, and then use the second training data to train the classification model corresponding to the target device. It can be seen that when there is historical operating data during normal operation of the device that causes the first training data to be unbalanced, the second training data is obtained by equalizing the first training data, so that the historical operating data during normal operation of the device is not balanced. It will cause the second training data to be unbalanced, thereby reducing the false alarm probability of the classification model trained by using the second training data, and improving the classification accuracy of the trained classification model.
下面结合附图对本发明实施例提供的分类模型训练方法和装置进行详细说明。The classification model training method and device provided by the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
如图1所示,本发明实施例提供了一种分类模型训练方法,该方法可以包括如下步骤:As shown in Fig. 1, an embodiment of the present invention provides a classification model training method, which may include the following steps:
步骤101:获取第一训练数据,其中,第一训练数据包括目标设备的历史运行数据;Step 101: Acquire first training data, where the first training data includes historical operating data of the target device;
步骤102:判断第一训练数据是否均衡;Step 102: Determine whether the first training data is balanced;
步骤103:如果第一训练数据不均衡,则向用户发送交互请求,其中,交互请求用于请求用户确定对第一训练数据进行处理的方式;Step 103: If the first training data is not balanced, send an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;
步骤104:接收用户响应于交互请求的均衡化处理指令,其中,均衡化处理指令包括至少一个数据集标识,至少一个数据集标识中的每一个数据集标识用于标识第一训练数据中导致第一训练数据不均衡的一个第一数据集,不同数据集标识所标识的第一数据集不同;Step 104: Receive a balance processing instruction from the user in response to the interaction request, where the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data. A first data set with unbalanced training data, and the first data sets identified by different data set identifiers are different;
步骤105:根据均衡化处理指令,分别针对每一个数据集标识所标识的第一数据集对第一训练数据进行均衡化处理,获得第二训练数据,其中,第二训练数据中与各个数据集标识所对应的各个第二数据集不会导致第二训练数据不均衡;Step 105: According to the equalization processing instruction, perform equalization processing on the first training data for the first data set identified by each data set identifier to obtain second training data, where the second training data and each data set Each second data set corresponding to the identifier will not cause the second training data to be unbalanced;
步骤106:利用第二训练数据训练与目标设备相对应的分类模型。Step 106: Use the second training data to train a classification model corresponding to the target device.
本发明实施例提供的分类模型训练方法,在获取到包括有目标设备的历史运行数据的第一训练数据后,判断第一训练数据是否均衡,当确定第一训练数据不均衡后向用户发送交互请求,以由用户来确定对第一训练数据进行处理的方式,当接收到用户响应于交互请求的均衡化处理指令后,分别针对均衡化处理指令所包括的每一个数据集标识所标识的第一数据集对第一训练数据进行均衡化处理,在均衡化处理所获得的第二训练数据中,与各个数据集标识所对应的第二数据集不会导致第二训练数据不均衡,之后利用所获得的第二训练数据来训练与目标设备相对应的分类模型。由此可见,当确定第一训练数据不均衡后,由用户确定是否需要对第一训练数据进行均衡化处理以及均衡化处理时所针对的第一数据集,当用户确定对第一训练数据进行均衡化处理时,通过对第一训练数据进行均衡化处理获得第二训练数据,使得第二训练数据中目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,从而利用第二训练数据训练出的分类模型不会由于训练数据不均衡而出现误判,保证所训练出的分类模型具有较高的分类准确率。In the classification model training method provided by the embodiment of the present invention, after obtaining the first training data including the historical operating data of the target device, it is judged whether the first training data is balanced, and when it is determined that the first training data is not balanced, an interaction is sent to the user The request is determined by the user to process the first training data. After receiving the equalization processing instruction from the user in response to the interaction request, it is directed to the first training data identified by each data set identifier included in the equalization processing instruction. A data set performs equalization processing on the first training data. In the second training data obtained by the equalization processing, the second data set corresponding to each data set identifier will not cause the second training data to be unbalanced, and then use The obtained second training data is used to train the classification model corresponding to the target device. It can be seen that when it is determined that the first training data is unbalanced, the user determines whether the first training data needs to be equalized and the first data set for which the equalization is processed. When the user determines to perform the first training data During the equalization process, the second training data is obtained by equalizing the first training data, so that the historical operating data of the target device during normal operation in the second training data will not cause the second training data to be unbalanced, thus using the second training data. The classification model trained by the training data will not be misjudged due to the imbalance of the training data, ensuring that the trained classification model has a high classification accuracy.
在本发明实施例中,步骤101中获取第一训练数据时,可以从目标设备的历史运行数据中选取包括一定数量样本的历史运行数据作为第一训练数据,具体可以将一个连续时间段内目标设备的历史运行数据作为第一训练数据,也可以将多个非连续时间段内目标设备的历史运行数据作为第一训练数据。另外,由于不同设备的运行环境并不完全相同,当需要针对一个设备训练分类模型以通过分类模型监测该设备是否出现异常时,为了保证所训练出分类模型具有较高的分类准确率,需要将该设备的历史运行数据作为训练数据来训练分类模型。In the embodiment of the present invention, when the first training data is obtained in step 101, the historical operation data including a certain number of samples can be selected from the historical operation data of the target device as the first training data. Specifically, the target device can be set as the first training data in a continuous period of time. The historical operating data of the device is used as the first training data, and the historical operating data of the target device in multiple discontinuous time periods may also be used as the first training data. In addition, because the operating environment of different devices is not completely the same, when a classification model needs to be trained for a device to monitor whether the device is abnormal through the classification model, in order to ensure that the trained classification model has a higher classification accuracy, it is necessary to The historical operating data of the device is used as training data to train the classification model.
在本发明实施例中,当确定出第一训练数据不均衡并向用户发送交互请求时,交互请求可以包括多个备选项,以方便用户从各个备选项中选择对第一训练数据进行处理的方式。交互请求可以包括如下三个备选项:基于现有训练数据训练分类模型、重新选择训练数据和对训练数据进行均衡化处理。下面分别对用户选择上述三个备选项后的处理方式进行说明:In the embodiment of the present invention, when it is determined that the first training data is unbalanced and an interaction request is sent to the user, the interaction request may include multiple alternatives, so that the user can select the one to process the first training data from each alternative. the way. The interaction request may include the following three alternative options: training a classification model based on existing training data, reselecting training data, and performing equalization processing on training data. The following describes the processing methods after the user selects the above three alternatives:
当用户选择基于现有训练数据训练分类模型时,直接利用第一训练数据来训练与目标设备相对应的分类模型。在此种情况下,第一训练数据中至少一个包括有少量样本的第一数据集导致第一训练数据不均衡,而每一个第一数据集中所包括的样本为目标设备运行异常时的历史运行数据,此时直接利用第一训练数据来训练分类模型,当利用所训练出的分类模型实时对目标设备的运行数据进行分析时,如果有运行数据落入任意一个第一数据集所对应的数值范围,分类模型会给出目标设备出现异常的结论。When the user chooses to train the classification model based on the existing training data, the first training data is directly used to train the classification model corresponding to the target device. In this case, at least one of the first training data including a small number of samples in the first data set causes the first training data to be unbalanced, and each of the samples included in the first data set is the historical operation when the target device operates abnormally Data, at this time, directly use the first training data to train the classification model. When using the trained classification model to analyze the operating data of the target device in real time, if the operating data falls into any value corresponding to the first data set Range, the classification model will give the conclusion that the target device is abnormal.
当用户选择重新选择训练数据时,则根据用户所提供的数据存储地址重新读取训练 数据,并将重新读取到的训练数据作为第一训练数据开始执行步骤102。在此种情况下,用户确认所选择的用于训练分类模型的第一训练数据有误,根据用户的指示重新选择训练数据,并进一步判断重新选择的训练数据是否均衡。When the user chooses to re-select the training data, the training data is re-read according to the data storage address provided by the user, and the re-read training data is used as the first training data to start step 102. In this case, the user confirms that the first training data selected for training the classification model is wrong, reselects the training data according to the user's instruction, and further determines whether the reselected training data is balanced.
当用户选择对训练数据进行均衡化处理时,向用户展示导致第一训练数据不均衡的各个数据集,由用户从所展示的各个数据集中选择第一数据集,进而用户可以生成包括有用于标识各个第一数据集的各个数据集标识的均衡化处理指令。在此种情况下,对判断第一训练数据是否均衡时可以确定导致第一训练数据不均衡的至少一个数据集,但该数据集所包括的少量样本可能是目标设备运行异常时的历史运行数据,也可能是目标设备正常运行时的历史运行数据,此时需要由用户进行辨别,将所包括少量样本为目标设备正常运行时历史运行数据的数据集确定为第一数据集,进而可以针对第一数据集对第一训练数据进行均衡化处理。这样,通过对第一训练数据进行均衡化处理而获得的第二训练数据中,目标设备正常运行时的历史运行数据不会导致第二训练数据不均衡,而目标设备运行异常时的历史运行数据可能导致第二训练数据不均衡,此时利用第二训练数据所训练出的分类模型可以更加准确地识别目标设备是否出现异常。When the user chooses to balance the training data, each data set that causes the first training data to be unbalanced is shown to the user, and the user selects the first data set from the displayed data sets, and the user can generate the The equalization processing instruction identified by each data set of each first data set. In this case, when determining whether the first training data is balanced, at least one data set that causes the first training data to be unbalanced can be determined, but a small number of samples included in the data set may be historical operating data when the target device is operating abnormally , It may also be the historical operating data of the target device during normal operation. At this time, it needs to be distinguished by the user, and the data set containing a small number of samples as historical operating data of the target device during normal operation is determined as the first data set. A data set performs equalization processing on the first training data. In this way, in the second training data obtained by equalizing the first training data, the historical operating data of the target device during normal operation will not cause the second training data to be unbalanced, while the historical operating data of the target device during abnormal operation This may cause the second training data to be unbalanced. At this time, the classification model trained by using the second training data can more accurately identify whether the target device is abnormal.
另外,在向用户发送交互请求时,可以通过提示框的形式向用户展示上述三个备选项,进而根据用户所选择的备选项与用户进行进一步的交互或直接根据用户所选择的备选项对第一训练数据进行处理。In addition, when sending an interaction request to the user, the above three alternatives can be displayed to the user in the form of a prompt box, and then further interaction with the user is carried out according to the alternatives selected by the user, or the second alternative is directly based on the alternatives selected by the user. One training data is processed.
在本发明实施例中,步骤102判断第一训练数据是否均衡时,如果判断结果为第一训练数据均衡,则直接利用第一训练数据来训练与目标设备相对应的分类模型。In the embodiment of the present invention, when determining whether the first training data is balanced in step 102, if the judgment result is that the first training data is balanced, the first training data is directly used to train the classification model corresponding to the target device.
在本发明实施例中,步骤106利用第二训练数据训练与目标设备相对应的分类模型时,可以将第二训练数据作为输入通过各种类型的机器算法来训练分类模型,具体机器算法可以为人工神经网络算法、深度学习算法、基于核的算法、集成算法、遗传算法等。In the embodiment of the present invention, when step 106 uses the second training data to train the classification model corresponding to the target device, the second training data can be used as input to train the classification model through various types of machine algorithms. The specific machine algorithm can be Artificial neural network algorithms, deep learning algorithms, core-based algorithms, integrated algorithms, genetic algorithms, etc.
可选地,在图1所示分类模型训练方法的基础上,步骤102判断第一训练数据是否均衡时,可以对第一训练数据进行聚类处理,将第一训练数据聚类为一个或多个数据集,进而根据聚类所获得数据集的个数以及数据集中样本个数的比值来确定第一训练数据是否均衡。具体地,如图2所示,确定第一训练数据是否均衡可以通过如下步骤实现:Optionally, based on the classification model training method shown in FIG. 1, when step 102 determines whether the first training data is balanced, the first training data may be clustered to cluster the first training data into one or more Data sets, and then determine whether the first training data is balanced according to the ratio of the number of data sets obtained by clustering and the number of samples in the data set. Specifically, as shown in Figure 2, determining whether the first training data is balanced can be achieved through the following steps:
步骤201:对第一训练数据进行聚类处理,获得至少一个第三数据集,其中,每个第三数据集包括有至少一个样本,同一第三数据集中各个样本的数值位于同一数值范围内,不同第三数据集中样本的数值位于不同的数值范围;Step 201: Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the values of each sample in the same third data set are within the same numerical range, The values of the samples in different third data sets are in different numerical ranges;
步骤202:判断对第一训练数据进行聚类处理是否仅获得一个第三数据集,如果是,执行步骤203,否则执行步骤204;Step 202: Determine whether clustering of the first training data only obtains a third data set, if yes, go to step 203, otherwise go to step 204;
步骤203:确定第一训练数据均衡,并结束当前流程;Step 203: Determine the first training data balance, and end the current process;
步骤204:从至少两个第三数据集中确定包括样本个数最少的第四数据集和包括样本个数最多的第五数据集;Step 204: Determine, from at least two third data sets, a fourth data set that includes the smallest number of samples and a fifth data set that includes the largest number of samples;
步骤205:判断第四数据集与第五数据集中样本个数的比值是否小于预设的占比阈值,如果是,执行步骤206,否则执行步骤203;Step 205: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, if yes, go to step 206, otherwise go to step 203;
步骤206:确定第一训练数据不均衡。Step 206: Determine that the first training data is not balanced.
在本发明实施例中,通过对第一训练数据进行聚类处理,可以将第一训练数据聚类为至少一个第三数据集,使得每一个第三数据集中包括有至少一个样本,而且同一第三数据集中各个样本的数值位于同一数值范围内,不同第三数据集中样本的数值位于不同数值范围内。第一训练数据由目标设备的历史运行数据组成,由于目标设备在同一运行模式下正常运行时运行数据的取值位于同一数值范围内,从而通过对第一训练数据进行聚类处理可以获得一个或多个第三数据集,同一第三数据集中各个样本的取值位于同一数值范围内。例如,目标设备具有两种运行模式,目标设备在第一种运行模式下正常运行时运行数据的数值范围为50~80,目标设备在第二种运行模式下正常运行时运行数据的数值范围为120~150,通过对第一训练数据进行聚类处理获得第三数据集1、第三数据集2和第三数据集3共计三个第三数据集,其中,第三数据集1中各个样本的取值均位于50~80范围内,第三数据集2中各个样本的取值均位于120~150范围内,第三数据集3中各个样本的取值均位于200-240范围内。In the embodiment of the present invention, by performing clustering processing on the first training data, the first training data can be clustered into at least one third data set, so that each third data set includes at least one sample, and the same The values of each sample in the three data sets are in the same numerical range, and the values of the samples in the different third data sets are in different numerical ranges. The first training data is composed of historical operating data of the target device. Since the value of the operating data of the target device is in the same numerical range when the target device is running normally in the same operating mode, one or more can be obtained by clustering the first training data. For multiple third data sets, the values of each sample in the same third data set are within the same numerical range. For example, the target device has two operating modes. When the target device is operating normally in the first operating mode, the value range of operating data is 50~80, and when the target device is operating normally in the second operating mode, the value range of operating data is 120~150, the third data set 1, the third data set 2, and the third data set 3 are obtained by clustering the first training data to obtain three third data sets, where each sample in the third data set 1 The value of is in the range of 50-80, the value of each sample in the third data set 2 is in the range of 120-150, and the value of each sample in the third data set 3 is in the range of 200-240.
在本发明实施例中,通过对第一训练数据进行聚类处理获得至少一个第三数据集后,首先根据第三数据集的个数对第一训练数据是否均衡进行初步判断。如果聚类处理仅获得一个第三数据集,即第一训练数据包括的所有样本的数值均位于同一数值范围内,此时可以确定第一训练数据均衡;如果聚类处理获得至少两个第三数据集,则需要根据各个第三数据集中样本的个数来进一步判断第一训练数据是否均衡。In the embodiment of the present invention, after at least one third data set is obtained by performing clustering processing on the first training data, first, a preliminary judgment is made on whether the first training data is balanced according to the number of the third data sets. If the clustering process only obtains one third data set, that is, the values of all samples included in the first training data are within the same numerical range, then the first training data can be determined to be balanced; if the clustering process obtains at least two third For the data set, it is necessary to further determine whether the first training data is balanced according to the number of samples in each third data set.
在本发明实施例中,当对第一训练数据进行聚类处理获得至少两个第三数据集后,从所获得的至少两个第三数据集中确定包括个数最少的第四数据集和包括样本个数最多的第五数据集,之后比较第四数据集与第五数据集中样本的个数,如果第四数据集与第五数据集中样本个数的比值小于预先设定的占比阈值,则确定第一训练数据不均衡,如果第四数据集与第五数据集中样本个数的比值大于或等于占比阈值,则确定第一训练数据均衡。由于第四数据集是所有第三数据集中包括样本个数最少的数据集,而第五数据集是所有第三数据集中包括样本个数最多的数据集,即第四数据集与第五数据集中样本数量的差距最大,如果第四数据集与第五数据集中样本个数的比值小于预先设定的占比 阈值,说明第一训练数据中存在至少一个第三数据集中的样本数量与其他第三数据集中的样本数量差距较大,即第一训练数据出现了数据不均衡的情况。In the embodiment of the present invention, after clustering the first training data to obtain at least two third data sets, it is determined from the obtained at least two third data sets that the fourth data set with the smallest number and the The fifth data set with the largest number of samples, then compare the number of samples in the fourth data set and the fifth data set, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, It is determined that the first training data is unbalanced, and if the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced. Since the fourth data set is the data set with the smallest number of samples in all the third data sets, and the fifth data set is the data set with the largest number of samples in all the third data sets, that is, the fourth data set and the fifth data set The difference in the number of samples is the largest. If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than the preset proportion threshold, it means that there is at least one sample number in the third data set in the first training data. There is a large gap in the number of samples in the data set, that is, data imbalance appears in the first training data.
在本发明实施例中,根据实际业务场景,占比阈值可以在0.5~1.0范围内灵活设定,占比阈值越大则对第一训练数据进行均衡检测的灵敏度越高,占比阈值具体可以取值0.6、0.7、0.85或者0.9。In the embodiment of the present invention, according to actual business scenarios, the proportion threshold can be flexibly set within the range of 0.5 to 1.0. The larger the proportion threshold, the higher the sensitivity of equalization detection on the first training data. The proportion threshold can be specifically The value is 0.6, 0.7, 0.85, or 0.9.
在本发明实施例中,步骤201对第一训练数据进行聚类处理时,可以通过K均值聚类算法(k-means clustering algorithm)、谱聚类算法、高斯混合模型聚类算法等对第一训练数据进行聚类处理。In the embodiment of the present invention, when performing clustering processing on the first training data in step 201, the first training data can be clustered through K-means clustering algorithm, spectral clustering algorithm, Gaussian mixture model clustering algorithm, etc. The training data is clustered.
通过对第一训练数据进行聚类处理获得一个或多个第三数据集,根据第三数据集的数量以及第三数据集所包括样本的个数分两个阶段来确定第一训练数据是否均衡,保证能够准确地对第一训练数据的均衡性进行判断,进而保证利用第一训练数据或第二训练数据所训练出的分类模型具有较高的分类准确性。One or more third data sets are obtained by clustering the first training data. According to the number of third data sets and the number of samples included in the third data set, it is determined whether the first training data is balanced in two stages , Ensuring that the balance of the first training data can be accurately judged, thereby ensuring that the classification model trained by using the first training data or the second training data has high classification accuracy.
需要说明的是,在图1所示分类模型训练方法的基础上,步骤102在判断第一训练数据是否均衡时,除了上述实施例通过聚类处理的方式来确定第一训练数据是否均衡外,还可以采用其他方式来确定第一训练数据是否均衡,比如可以通过对第一训练数据进行直方图曲线拟合来确定第一训练数据是否均衡,还可以通过估计第一训练数据的分布来确定第一训练数据是否均衡。It should be noted that, on the basis of the classification model training method shown in FIG. 1, when determining whether the first training data is balanced in step 102, except for the foregoing embodiment to determine whether the first training data is balanced through clustering processing, Other methods can also be used to determine whether the first training data is balanced. For example, the first training data can be determined by performing histogram curve fitting to determine whether the first training data is balanced, and the first training data can also be estimated by estimating the distribution of the first training data. 1. Whether the training data is balanced.
可选地,在图2所示第一训练数据均衡性判断方法的基础上,当确定第一训练数据不均衡,并根据来自用户的均衡化处理指令对第一训练数据进行均衡化处理时,可以通过向第一训练数据中增加样本的方式来解决第一训练数据不均衡的问题。具体地,如图3所示,可以通过如下方式对第一训练数据进行均衡化处理:Optionally, based on the first training data balance judgment method shown in FIG. 2, when it is determined that the first training data is unbalanced, and the first training data is equalized according to the equalization processing instruction from the user, The problem of unbalanced first training data can be solved by adding samples to the first training data. Specifically, as shown in FIG. 3, the first training data can be equalized in the following manner:
步骤301:针对均衡化处理指令包括的每一个数据集标识,确定一个与该数据集标识相对应的第六数据集,其中,该第六数据集中样本与该数据集标识所标识的第一数据集中样本的数值位于同一数值范围,且该数据集标识所标识的第一数据集与该第六数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值;Step 301: For each data set identifier included in the equalization processing instruction, determine a sixth data set corresponding to the data set identifier, wherein the sample in the sixth data set and the first data identified by the data set identifier The values of the concentrated samples are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion threshold;
步骤302:将确定出的与各个数据集标识相对应的各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据。Step 302: Combine the determined samples in each sixth data set corresponding to each data set identifier with the first training data to obtain second training data.
在本发明实施例中,针对均衡化处理指令包括的每一个数据集标识,该数据集标识用于标识一个相对应的第一数据集,而该第一数据集是由用户从各个第三数据集中选择出来的数据集,由于该第一数据集中样本个数较少导致第一训练数据不均衡,为此可以针对该数据集标识确定一个相对应的第六数据集,该第六数据集中包括有至少一个样本, 该第六数据集中各个样本的数值与该第一数据集中样本的数值位于同一数值范围,且该第六数据集与该第一数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值。将各个数据集标识所对应的第六数据集中的样本与第一训练数据进行组合,获得第二训练数据。在第二训练数据中,针对每一个导致第一训练数据不均衡的第一数据集,由于该第一数据集与相对应第六数据集中样本的总数大于或等于占比阈值,所以将该第一数据集与相对应的第六数据集作为一个整体看待时不会造成第二训练数据不均衡。In the embodiment of the present invention, for each data set identifier included in the equalization processing instruction, the data set identifier is used to identify a corresponding first data set, and the first data set is obtained by the user from each third data set. In the data set selected in a centralized manner, the first training data is not balanced due to the small number of samples in the first data set. Therefore, a corresponding sixth data set can be determined for the data set identifier, and the sixth data set includes There is at least one sample, the value of each sample in the sixth data set is in the same value range as the value of the sample in the first data set, and the total number of samples in the sixth data set and the first data set is equal to the number of samples in the fifth data set The ratio of the number is greater than or equal to the proportion threshold. The samples in the sixth data set corresponding to each data set identifier are combined with the first training data to obtain second training data. In the second training data, for each first data set that causes the first training data to be unbalanced, since the total number of samples in the first data set and the corresponding sixth data set is greater than or equal to the proportion threshold, the first data set When one data set and the corresponding sixth data set are viewed as a whole, the second training data will not be unbalanced.
例如,接续前述实施例中的例子,均衡化处理指令中包括有数据集标识1,数据集标识1用于标识对第一训练数据进行处理而获得的第一数据集1,而第一数据集1即为前述实施例中的第三数据集2,由于第一数据集1相对于第三数据集1所包括样本数较少导致第一训练数据不均衡,为此针对数据集标识1确定一个第六数据集,第六数据集中所包括样本的取值均位于120~150范围内(与第一数据集1中样本的数值范围相同),并且第六数据集与第一数据集1中样本的总数与第五数据集(此处为第三数据集1)中样本个数的比值大于或等于占比阈值。这样,由于第一数据集1与第六数据集中的样本具有相同光的数值范围,第一数据集1与第六数据集为同类样本,组合到一起的第一数据集1与第六数据集不会导致第二训练数据不均衡。For example, following the example in the foregoing embodiment, the equalization processing instruction includes a data set identifier 1. The data set identifier 1 is used to identify the first data set 1 obtained by processing the first training data, and the first data set 1 is the third data set 2 in the foregoing embodiment. The first training data is not balanced due to the small number of samples included in the first data set 1 compared to the third data set 1. Therefore, one is determined for the data set identifier 1. The sixth data set, the values of the samples included in the sixth data set are all in the range of 120-150 (the same value range as the samples in the first data set 1), and the sixth data set is the same as the samples in the first data set 1. The ratio of the total number of to the number of samples in the fifth data set (here, the third data set 1) is greater than or equal to the proportion threshold. In this way, since the samples in the first data set 1 and the sixth data set have the same light value range, the first data set 1 and the sixth data set are samples of the same kind, and the first data set 1 and the sixth data set are combined together Will not cause the second training data to be unbalanced.
可选地,在图3所示均衡化处理第一训练数据方法的基础上,步骤301针对每一个数据集标识确定相对应的第六数据集时,可以从数据集标识所标识的第一数据集中采集样本作为与该数据集标识相对应的第六数据集。针对每一个数据集标识,确定与该数据集标识相对应的第六数据集的方法如图4所示,具体可以包括如下步骤:Optionally, on the basis of the method of equalizing the first training data shown in FIG. 3, when determining the corresponding sixth data set for each data set identifier in step 301, the first data identified by the data set identifier may be Collect samples collectively as the sixth data set corresponding to the data set identifier. For each data set identifier, the method for determining the sixth data set corresponding to the data set identifier is shown in FIG. 4, which may specifically include the following steps:
步骤401:从该数据集标识所标识的第一数据集中采集至少一个样本;Step 401: Collect at least one sample from the first data set identified by the data set identifier;
步骤402:将包括有采集到的至少一个样本的样本集合作为与该数据集标识相对应的第六数据集。Step 402: Use a sample set including at least one collected sample as a sixth data set corresponding to the data set identifier.
在本发明实施例中,针对每一个数据集标识,为了获取到与该数据集标识所标识的第一数据集中样本位于同一数值范围内的样本,可以直接从该数据集标识所标识的第一数据集中采集样本,进而将包括有采集到的各个样本的数据集合作为与该数据集标识相对应的第六数据集。例如,针对均衡化处理指令所包括的数据集标识1,由于数据集标识1用于标识第一数据集1,进而可以从第一数据集1中采集相应数量的样本,将采集到的各个样本所组成的数据集合作为与数据集标识1相对应的第六数据集。In the embodiment of the present invention, for each data set identifier, in order to obtain samples in the same numerical range as the sample in the first data set identified by the data set identifier, the first data set identifier identified by the data set identifier Samples are collected in a data set, and then a data set including each collected sample is used as a sixth data set corresponding to the data set identifier. For example, for the data set identifier 1 included in the equalization processing instruction, since the data set identifier 1 is used to identify the first data set 1, a corresponding number of samples can be collected from the first data set 1, and the collected samples The formed data set serves as the sixth data set corresponding to the data set identifier 1.
由于最终需要将确定出的各个第六数据集中的样本与第一训练数据进行组合而获得第二训练数据,而第六数据集中样本是从第一数据集中采集得到的,因此在本质上是将第一数据集中的部分或全部样本复制一次或多次而获得第六数据集,这样既保证了第六 数据集中样本与相应第一数据集中样本具有相同的数值范围,还可以保证确定第六数据集的方便性。Since it is ultimately necessary to combine the determined samples in the sixth data set with the first training data to obtain the second training data, and the samples in the sixth data set are collected from the first data set, it is essentially Part or all of the samples in the first data set are copied one or more times to obtain the sixth data set, which not only ensures that the samples in the sixth data set and the corresponding samples in the first data set have the same value range, but also ensures that the sixth data is determined Convenience of the set.
在本发明实施例中,在从第一数据集中采集样本时,可以对第一数据集中的所有样本进行复制一次或多次,还可以采用随机采样的方式从第一数据集中随机采集样本。In the embodiment of the present invention, when collecting samples from the first data set, all samples in the first data set may be copied one or more times, and random sampling methods may also be used to randomly collect samples from the first data set.
另外,在确定与数据集标识相对应的第六数据集时,除了按照图4所示的方式从相应第一数据集中采集样本而获得第六数据集外,还可以采用其他的方式来获得第六数据集,比如,可以从目标设备的历史运行数据中采集样本,还可以基于第一数据集中的样本直接生成新的样本。In addition, when determining the sixth data set corresponding to the data set identifier, in addition to collecting samples from the corresponding first data set in the manner shown in FIG. 4 to obtain the sixth data set, other methods may also be used to obtain the first data set. Six data sets, for example, samples can be collected from the historical operating data of the target device, and new samples can be directly generated based on the samples in the first data set.
可选地,在上述各个实施例所提供分类模型训练方法的基础上,步骤103向用户发送交互请求后,如果用户指示重新选择训练数据,则可以按照用户的指示重新选择训练数据,以此解决第一训练数据不均衡的问题。具体地,如图5所示,重选训练数据的方法可以包括如下步骤:Optionally, on the basis of the classification model training methods provided in the foregoing embodiments, after step 103 sends an interaction request to the user, if the user instructs to reselect the training data, the training data can be reselected according to the user's instruction to solve this problem. The first problem of unbalanced training data. Specifically, as shown in FIG. 5, the method for reselecting training data may include the following steps:
步骤501:接收用户响应于交互请求的数据重选指令,其中,数据重选指令包括数据读取地址;Step 501: Receive a data reselection instruction from a user in response to an interaction request, where the data reselection instruction includes a data read address;
步骤502:根据数据重选指令,从与数据读取地址相对应的存储空间读取第三训练数据;Step 502: According to the data reselection instruction, read the third training data from the storage space corresponding to the data read address;
步骤503:将第三训练数据作为第一训练数据,并执行判断第一训练数据是否均衡。Step 503: Use the third training data as the first training data, and execute to determine whether the first training data is balanced.
在向用户发送交互请求后,如果用户响应交互请求发送了数据重选指令,则根据数据重选指令包括的数据读取地址从相对应的存储空间读取第三训练数据,之后将读取到的第三训练数据作为第一训练数据,并重新开始执行步骤102。After sending the interaction request to the user, if the user sends a data reselection instruction in response to the interaction request, the third training data is read from the corresponding storage space according to the data read address included in the data reselection instruction, and then the third training data is read The third training data of is used as the first training data, and step 102 is restarted.
在确定出第一训练数据不均衡后,向用户发送交互请求,用户可以发送数据重选指令以重新选择用于训练分类模型的训练数据,以此解决此前所选择的第一训练数据不均衡的问题,这样为用户提供了另一种处理数据不均衡的方式,可以提升用户的使用体验。After determining that the first training data is unbalanced, send an interactive request to the user, and the user can send a data reselection instruction to reselect the training data used to train the classification model, so as to solve the unbalanced first training data previously selected The problem is that this provides users with another way to deal with data imbalance, which can improve the user experience.
下面以训练根据电机运行过程中的电流数据确定电机是否运行异常的分类模型为例,对本发明实施例提供的分类模型训练方法作进一步详细说明,如图6所示,该方法可以包括如下步骤:The following is an example of training a classification model that determines whether the motor is operating abnormally based on the current data during the operation of the motor, and further describes the classification model training method provided by the embodiment of the present invention. As shown in FIG. 6, the method may include the following steps:
步骤601:获取第一训练数据。Step 601: Obtain first training data.
在本发明实施例中,在训练用于分析电机A是否发生异常的分类模型时,从电机A的历史运行数据中获取一定数据量的历史运行数据作为第一训练数据,具体为获取电机A的电流数据。例如,获取电机A过去3个月内运行是的电流数据作为第一训练数据。In the embodiment of the present invention, when training a classification model for analyzing whether motor A is abnormal, a certain amount of historical operating data is obtained from the historical operating data of motor A as the first training data, specifically obtaining the information of motor A Current data. For example, the current data of motor A running in the past 3 months is acquired as the first training data.
步骤602:对第一训练数据进行聚类处理,获得至少一个第三数据集。Step 602: Perform clustering processing on the first training data to obtain at least one third data set.
在本发明实施例中,在获取到第一训练数据后,对第一训练数据进行聚类处理,将第一训练数据聚类为至少一个第三数据集,使得每一个第三数据集中包括有至少一个样本,而且同一第三数据集中各个样本的取值位于同一数值范围内,不同第三数据集中样本的取值位于不同数值范围内。即每一个第三数据集对应有一个数值范围,不同第三数据集所对应的数值范围没有重叠部分。另外,每一个样本即对应一条电流数据。In the embodiment of the present invention, after the first training data is obtained, the first training data is clustered, and the first training data is clustered into at least one third data set, so that each third data set includes At least one sample, and the values of each sample in the same third data set are within the same numerical range, and the values of the samples in different third data sets are within different numerical ranges. That is, each third data set corresponds to a numerical range, and the numerical ranges corresponding to different third data sets do not overlap. In addition, each sample corresponds to a piece of current data.
例如,通过对第一训练数据进行聚类处理,获得三个第三数据集,分别为第三数据集1、第三数据集2和第三数据集3,第三数据集1中包括有8000个样本,第三数据集2中包括1000个样本,第三数据集3中包括2000个样本,第三数据集1中各个样本的取值均位于50~80范围内,第三数据集2中各个样本的取值均位于120~150范围内,第三数据集3中各个样本的取值均位于200-240范围内。For example, by clustering the first training data, three third data sets are obtained, namely, the third data set 1, the third data set 2, and the third data set 3. The third data set 1 includes 8000 The third data set 2 includes 1000 samples, the third data set 3 includes 2000 samples, the value of each sample in the third data set 1 is in the range of 50 to 80, and the third data set 2 The value of each sample is in the range of 120-150, and the value of each sample in the third data set 3 is in the range of 200-240.
步骤603:判断是否仅获得一个第三数据集,如果是,执行步骤604,否则执行步骤605。Step 603: Determine whether only one third data set is obtained, if yes, go to step 604, otherwise go to step 605.
在本发明实施例中,在对第一训练数据进行聚类处理获得第三数据集后,判断聚类处理是否仅获得一个第三数据集,如果仅获得一个第三数据集则说明第一训练数据均衡,相应地执行步骤604,如果获得了至少两个第三数据集则需要进一步判断第一训练数据是否均衡,相应地执行步骤605。In the embodiment of the present invention, after the first training data is clustered to obtain the third data set, it is determined whether the clustering process only obtains one third data set. If only one third data set is obtained, the first training For data equalization, step 604 is performed accordingly. If at least two third data sets are obtained, it is necessary to further determine whether the first training data is balanced, and step 605 is performed accordingly.
步骤604:利用第一训练数据训练分类模型,并结束当前流程。Step 604: Use the first training data to train the classification model, and end the current process.
在本发明实施例中,在对第一训练数据进行聚类处理仅获得一个第三数据集时,说明第一训练数据均衡,直接利用第一训练数据训练与电机A相对应的分类模型。In the embodiment of the present invention, when the first training data is clustered and only one third data set is obtained, it is explained that the first training data is balanced, and the first training data is directly used to train the classification model corresponding to motor A.
步骤605:从各个第三数据集中确定第四数据集和第五数据集。Step 605: Determine the fourth data set and the fifth data set from each third data set.
在本发明实施例中,当获得到至少两个第三数据集时,从各个第三数据集中确定包括样本个数最少的第四数据集,并从各个第三数据集中确定包括样本个数最多的第五数据集。In the embodiment of the present invention, when at least two third data sets are obtained, the fourth data set including the smallest number of samples is determined from each third data set, and the fourth data set including the largest number of samples is determined from each third data set The fifth data set.
例如,由于第三数据集1中样本个数大于第三数据集3中样本个数,第三数据集3中样本个数大于第三数据集2中样本个数,因此,将第三数据集1确定为第五数据集,将第三数据集2确定为第四数据集。For example, because the number of samples in the third data set 1 is greater than the number of samples in the third data set 3, and the number of samples in the third data set 3 is greater than the number of samples in the third data set 2, the third data set 1 is determined as the fifth data set, and the third data set 2 is determined as the fourth data set.
步骤606:判断第四数据集与第五数据集中样本个数的比值是否小于占比阈值,如果是,执行步骤607,否则执行步骤604。Step 606: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold, if yes, go to step 607, otherwise go to step 604.
在本发明实施例中,在获取到第四数据集和第五数据集后,将第四数据集与第五数据集中样本个数的比值与预先设定的占比阈值进行比较,如果第四数据集与第五数据集 中样本个数的比值小于占比阈值,则确定第一训练数据不均衡,相应地执行步骤607,如果第四数据集与第五数据集中样本个数的比值大于获得等于占比阈值,则确定第一训练数据均衡,相应地执行步骤604。In the embodiment of the present invention, after the fourth data set and the fifth data set are obtained, the ratio of the number of samples in the fourth data set and the fifth data set is compared with a preset threshold value. If the ratio of the number of samples in the data set to the number of samples in the fifth data set is less than the proportion threshold, it is determined that the first training data is unbalanced, and step 607 is performed accordingly. If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to If the proportion threshold is determined, the first training data is determined to be balanced, and step 604 is executed accordingly.
步骤607:向用户发送交互请求。Step 607: Send an interaction request to the user.
在本发明实施例中,当确定第一训练数据不均衡后,向用户发送交互请求,请求用户确定对第一训练数据进行处理的方式。In the embodiment of the present invention, when it is determined that the first training data is unbalanced, an interaction request is sent to the user, requesting the user to determine a method for processing the first training data.
步骤608:判断是否接收到用户响应于交互请求的模型训练指令,如果是,执行步骤604,否则执行步骤609。Step 608: Determine whether a model training instruction from the user in response to the interaction request is received, if yes, execute step 604, otherwise execute step 609.
在本发明实施例中,在向用户发送交互请求后,如果接收到用户响应于交互请求的模型训练指令,其中模型训练指令用于指示利用现有的第一训练数据训练分类模型,则执行步骤604利用第一训练数据来训练与电机A相对应的分类模型。In the embodiment of the present invention, after the interaction request is sent to the user, if a model training instruction from the user in response to the interaction request is received, where the model training instruction is used to instruct to use the existing first training data to train the classification model, execute the step 604 uses the first training data to train a classification model corresponding to motor A.
步骤609:判断是否接收到用户响应于交互请求的数据重选指令,如果是,执行步骤610,否则执行步骤611。Step 609: Determine whether a data reselection instruction from the user in response to the interaction request is received, if yes, go to step 610, otherwise go to step 611.
在本发明实施例中,在向用户发送交互请求后,如果接收到用户响应于交互请求的数据重选指令,则需要重新选择用于训练分类模型的训练数据,相应地执行步骤610。In the embodiment of the present invention, after sending an interaction request to the user, if a data reselection instruction from the user in response to the interaction request is received, the training data for training the classification model needs to be reselected, and step 610 is performed accordingly.
步骤610:根据数据重选指令重新获取第一训练数据,并执行步骤602。Step 610: Re-acquire the first training data according to the data reselection instruction, and execute step 602.
在本发明实施例中,在接收到数据重选指令后,根据数据重选指令包括的数据读取地址从相对应的存储空间读取第三训练数据,将读取到的第三训练数据作为第一训练数据,之后执行步骤602。In the embodiment of the present invention, after the data reselection instruction is received, the third training data is read from the corresponding storage space according to the data read address included in the data reselection instruction, and the read third training data is used as After the first training data, step 602 is executed.
例如,数据重选指令所包括的数据读取地址为电机A的电流数据存储地址,进而从电流数据存储地址读取电机A的电流数据作为新的第一训练数据,之后重新开始执行步骤602。For example, the data read address included in the data reselection instruction is the current data storage address of motor A, and then the current data of motor A is read from the current data storage address as the new first training data, and then step 602 is restarted.
步骤611:判断是否接收到用户响应于交互请求的均衡化处理指令,如果是,执行步骤612,否则执行615。Step 611: Determine whether the equalization processing instruction of the user in response to the interaction request is received, if yes, execute step 612, otherwise execute 615.
在本发明实施例中,在向用户发送交互请求后,如果接收到用户响应于交互请求的均衡化处理指令,则需要对第一训练数据进行均衡化处理,相应地执行步骤612,否则说明用户没有给出相应地指示,结束当前流程。In the embodiment of the present invention, after sending an interaction request to the user, if the user receives an equalization processing instruction in response to the interaction request, the first training data needs to be equalized, and step 612 is performed accordingly. Otherwise, it indicates that the user No corresponding instructions are given and the current process ends.
步骤612:分别确定均衡化处理指令所包括的每一个数据集标识对应的第六数据集。Step 612: Determine the sixth data set corresponding to each data set identifier included in the equalization processing instruction.
在本发明实施例中,在接收到均衡化处理指令后,获取均衡化处理指令所包括的至少一个数据集标识,每一个数据集标识用于标识一个第一数据集,不同数据集标识用于标识不同的第一数据集,第一数据集是用户从导致第一训练数据不均衡的各个第三数据 集中选择出的数据集。针对获取到的每一个数据集标识,确定一个与该数据集标识相对应的第六数据集,其中,第六数据集包括有至少一个样本,第六数据集中样本的取值与该数据集标识所标识的第一数据集中样本的取值位于同一数值范围,且第六数据集与该数据集标识所标识的第一数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值。具体的,针对每一个数据集标识,可以从该数据集标识所标识的第一数据集中采集样本,将采集到的各个样本组合为与该数据集标识相对应的第六数据集。进一步地,针对每一个数据集标识,该数据集标识所对应第六数据集中样本的个数满足如下条件:该数据集标识所对应的第六数据集与该数据集标识所标识的第一数据集中样本的总数与第五数据集中样本个数的比值等于占比阈值。In the embodiment of the present invention, after the equalization processing instruction is received, at least one data set identifier included in the equalization processing instruction is acquired, each data set identifier is used to identify a first data set, and different data set identifiers are used for Identify different first data sets, where the first data set is a data set selected by the user from each third data set that causes the first training data to be unbalanced. For each acquired data set identifier, determine a sixth data set corresponding to the data set identifier, where the sixth data set includes at least one sample, the value of the sample in the sixth data set and the data set identifier The values of the samples in the first data set identified are in the same numerical range, and the ratio of the total number of samples in the first data set identified by the sixth data set and the data set identifier to the number of samples in the fifth data set is greater than or equal to Ratio threshold. Specifically, for each data set identifier, samples may be collected from the first data set identified by the data set identifier, and the collected samples may be combined into a sixth data set corresponding to the data set identifier. Further, for each data set identifier, the number of samples in the sixth data set corresponding to the data set identifier satisfies the following conditions: the sixth data set corresponding to the data set identifier and the first data identified by the data set identifier The ratio of the total number of samples in the concentration to the number of samples in the fifth data set is equal to the proportion threshold.
例如,均衡化处理指令中包括数据集标识1,数据集标识1用于标识第三数据集2,预先设定的占比阈值为0.8,则从第三数据集2中采集5400个样本,将采集到的5400个样本的集合作为与数据集标识1相对应的第六数据集。需要说明的是,经用户交互,用户确定第三数据集3中的样本为电机A运行异常时的电流数据,因此不需要针对第三数据集3对第一训练数据进行均衡化处理。For example, if the equalization processing instruction includes the data set identifier 1, the data set identifier 1 is used to identify the third data set 2, and the preset ratio threshold is 0.8, then 5400 samples are collected from the third data set 2, and The collected 5400 samples are taken as the sixth data set corresponding to data set identifier 1. It should be noted that, after user interaction, the user determines that the samples in the third data set 3 are current data when the motor A is running abnormally, so there is no need to perform equalization processing on the first training data for the third data set 3.
步骤613:将各个第六数据集中的样本和第一训练数据进行组合,获得第二训练数据。Step 613: Combine the samples in each sixth data set with the first training data to obtain second training data.
在本发明实施例中,在获取到每一个数据集标识所对应的第六数据集后,将所获取到的各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据。In the embodiment of the present invention, after obtaining the sixth data set corresponding to each data set identifier, the obtained samples in each sixth data set are combined with the first training data to obtain the second training data.
例如,将第一训练数据所包括的11000个样本与数据集标识1所对应的第六数据集包括的5400个样本相组合,获得包括有16400个样本的第二训练数据。For example, 11000 samples included in the first training data are combined with 5400 samples included in the sixth data set corresponding to the data set identifier 1, to obtain second training data including 16,400 samples.
步骤614:利用第二训练数据训练分类模型。Step 614: Use the second training data to train the classification model.
在本发明实施例中,在获取到第二训练数据之后,利用所获取到的第二训练数据来训练与电机A相对应的分类模型。In the embodiment of the present invention, after the second training data is obtained, the obtained second training data is used to train the classification model corresponding to the motor A.
步骤615:结束当前流程。Step 615: End the current process.
需要说明的是,在图6所示的模型训练方法中,各个步骤是为了更加清楚地说明模型训练过程而拆分出来的,在实际业务显示过程中各个步骤之间没有绝对地先后顺序,比如,步骤609和步骤611可以在步骤608之前执行,步骤611可以在步骤609之前执行等。It should be noted that in the model training method shown in Figure 6, each step is split to explain the model training process more clearly. There is no absolute sequence between the steps in the actual business display process, such as , Step 609 and step 611 can be executed before step 608, step 611 can be executed before step 609, and so on.
如图7所示,本发明一个实施例提供了一种分类模型训练装置,包括:As shown in FIG. 7, an embodiment of the present invention provides a classification model training device, including:
一个数据获取模块701,用于获取第一训练数据,其中,第一训练数据包括目标设备的历史运行数据;A data acquisition module 701 for acquiring first training data, where the first training data includes historical operating data of the target device;
一个数据判断模块702,用于判断数据获取模块701获取到的第一训练数据是否均衡;A data judging module 702 for judging whether the first training data obtained by the data obtaining module 701 is balanced;
一个请求发送模块703,用于根据数据判断模块702的判断结果,如果第一训练数据不均衡,则向用户发送交互请求,其中,交互请求用于请求用户确定对第一训练数据进行处理的方式;A request sending module 703 is used to send an interactive request to the user according to the judgment result of the data judgment module 702 if the first training data is not balanced, where the interactive request is used to request the user to determine how to process the first training data ;
一个指令接收模块704,用于接收用户响应于请求发送模块703所发送交互请求的均衡化处理指令,其中,均衡化处理指令包括至少一个数据集标识,至少一个数据集标识中的每一个数据集标识用于标识第一训练数据中导致第一训练数据不均衡的一个第一数据集,不同数据集标识所标识的第一数据集不同;An instruction receiving module 704 for receiving a balance processing instruction from a user in response to an interaction request sent by the request sending module 703, wherein the balance processing instruction includes at least one data set identifier, and each data set in the at least one data set identifier The identifier is used to identify a first data set in the first training data that causes the first training data to be unbalanced, and the first data sets identified by different data set identifiers are different;
一个数据处理模块705,用于根据指令接收模块704接收到的均衡化处理指令,分别针对每一个数据集标识所标识的第一数据集对第一训练数据进行均衡化处理,获得第二训练数据,其中,第二训练数据中与各个数据集标识所对应的各个第二数据集不会导致第二训练数据不均衡;A data processing module 705 is configured to perform equalization processing on the first training data for the first data set identified by each data set identifier according to the equalization processing instruction received by the instruction receiving module 704 to obtain the second training data , Wherein each second data set corresponding to each data set identifier in the second training data will not cause the second training data to be unbalanced;
一个模型训练模块706,用于利用数据处理模块705获取到的第二训练数据训练与目标设备相对应的分类模型。A model training module 706 is configured to use the second training data obtained by the data processing module 705 to train a classification model corresponding to the target device.
在本发明实施例中,数据获取模块701可用于执行上述方法实施例中的步骤101,数据判断模块702可用于执行上述方法实施例中的步骤102,请求发送模块703可用于执行上述方法实施例中的步骤103,指令接收模块704可用于执行上述方法实施例中的步骤104,数据处理模块705可用于执行上述方法实施例中的步骤105,模型训练模块706可用于执行上述方法实施例中的步骤106。In the embodiment of the present invention, the data acquisition module 701 can be used to perform step 101 in the above method embodiment, the data judgment module 702 can be used to perform step 102 in the above method embodiment, and the request sending module 703 can be used to perform the above method embodiment. In step 103, the instruction receiving module 704 can be used to perform step 104 in the above method embodiment, the data processing module 705 can be used to perform step 105 in the above method embodiment, and the model training module 706 can be used to perform step 104 in the above method embodiment. Step 106.
可选地,在图7所示分类模型训练装置的基础上,如图8所示,数据判断模块702包括:Optionally, based on the classification model training device shown in FIG. 7, as shown in FIG. 8, the data judgment module 702 includes:
一个聚类处理单元7021,用于对第一训练数据进行聚类处理,获得至少一个第三数据集,其中,每个第三数据集包括有至少一个样本,同一第三数据集中各个样本的数值位于同一数值范围内,不同第三数据集中样本的数值位于不同的数值范围;A clustering processing unit 7021 is configured to perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set In the same numerical range, the values of samples in different third data sets are in different numerical ranges;
一个第一判断单元7022,用于在聚类处理单元7021对第一训练数据进行聚类处理获得一个第三数据集时,确定第一训练数据均衡;A first judgment unit 7022, configured to determine the first training data balance when the cluster processing unit 7021 performs clustering processing on the first training data to obtain a third data set;
一个第二判断单元7023,用于在聚类处理单元7021对第一训练数据进行聚类处理获得至少两个第三数据集时,从至少两个第三数据集中确定包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,如果第四数据集与第五数据集中样本个数的比值小于预设的占比阈值,确定第一训练数据不均衡,如果第四数据集与第五数据集中样本个数的比值大于或等于占比阈值,确定第一训练数据均衡。A second judging unit 7023 is configured to: when the clustering processing unit 7021 performs clustering processing on the first training data to obtain at least two third data sets, determine from the at least two third data sets the first with the least number of samples The fourth data set and the fifth data set including the largest number of samples, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, it is determined that the first training data is not balanced, if the fourth data The ratio of the number of samples in the fifth data set is greater than or equal to the proportion threshold, and the first training data balance is determined.
在本发明实施例中,聚类处理单元7021可用于执行上述方法实施例中的步骤201,第一判断单元7022可用于执行上述方法实施例中的步骤203,第二判断单元7023可用于执行上述方法实施例中的步骤204至步骤206。In the embodiment of the present invention, the cluster processing unit 7021 can be used to perform step 201 in the above method embodiment, the first judgment unit 7022 can be used to perform step 203 in the above method embodiment, and the second judgment unit 7023 can be used to perform the above Steps 204 to 206 in the method embodiment.
可选地,在图8所示分类模型训练装置的基础上,如图9所示,数据处理模块705包括:Optionally, based on the classification model training device shown in FIG. 8, as shown in FIG. 9, the data processing module 705 includes:
一个数据采集单元7051,用于针对均衡化处理指令包括的每一个数据集标识,确定一个与该数据集标识相对应的第六数据集,其中,该第六数据集中样本与该数据集标识所标识的第一数据集中样本的数值位于同一数值范围,且该数据集标识所标识的第一数据集与该第六数据集中样本的总数与第五数据集中样本个数的比值大于或等于占比阈值;A data collection unit 7051 is used to determine a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein the sample in the sixth data set is related to the data set identifier. The values of the samples in the first data set identified are in the same value range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion Threshold
一个数据组合单元7052,用于将数据采集单元7051确定出的各个第六数据集中的样本与第一训练数据进行组合,获得第二训练数据。A data combination unit 7052 is used to combine the samples in each sixth data set determined by the data collection unit 7051 with the first training data to obtain second training data.
在本发明实施例中,数据采集单元7051可用于执行上述方法实施例中的步骤301,数据组合单元7052可用于执行上述方法实施例中的步骤302。In the embodiment of the present invention, the data collection unit 7051 can be used to perform step 301 in the above method embodiment, and the data combination unit 7052 can be used to perform step 302 in the above method embodiment.
可选地,在图9所示分类模型训练装置的基础上,Optionally, on the basis of the classification model training device shown in FIG. 9,
数据采集单元7051,用于针对均衡化处理指令包括的每一个数据集标识,从与该数据集标识所标识的第一数据集中采集至少一个样本,并将包括有采集到的至少一个样本的样本集合作为与该数据集标识相对应的第六数据集。The data collection unit 7051 is configured to collect at least one sample from the first data set identified by the data set identifier for each data set identifier included in the equalization processing instruction, and include a sample containing the collected at least one sample Set as the sixth data set corresponding to the data set identifier.
在本发明实施例中,数据采集单元7051可用于执行上述方法实施例中的步骤401和步骤402。In the embodiment of the present invention, the data collection unit 7051 may be used to execute step 401 and step 402 in the foregoing method embodiment.
可选地,在图7至图9中任一附图所示分类模型训练装置的基础上,如图10所示,该分类模型训练装置可以进一步包括:一个数据重选模块707;Optionally, based on the classification model training device shown in any one of the drawings in FIG. 7 to FIG. 9, as shown in FIG. 10, the classification model training device may further include: a data reselection module 707;
指令接收模块704,进一步用于接收用户响应于请求发送模块703所发送交互请求的数据重选指令,其中,数据重选指令包括数据读取地址;The instruction receiving module 704 is further configured to receive a data reselection instruction sent by the user in response to the interaction request sent by the request sending module 703, where the data reselection instruction includes a data read address;
数据重选模块707,用于根据指令接收模块704接收到的数据重选指令,从与数据读取地址相对应的存储空间读取第三训练数据,并将第三训练数据作为第一训练数据,之后触发数据判断模块702执行判断第一训练数据是否均衡。The data reselection module 707 is configured to read the third training data from the storage space corresponding to the data read address according to the data reselection instruction received by the instruction receiving module 704, and use the third training data as the first training data After that, the data judgment module 702 is triggered to judge whether the first training data is balanced.
在本发明实施例中,指令接收模块704可用于执行上述方法实施例中的步骤501,数据重选模块707可用于执行上述方法实施例中的步骤502和步骤503。In the embodiment of the present invention, the instruction receiving module 704 can be used to execute step 501 in the above method embodiment, and the data reselection module 707 can be used to execute step 502 and step 503 in the above method embodiment.
如图11所示,本发明一个实施例提供了一种分类模型训练装置,包括:As shown in FIG. 11, an embodiment of the present invention provides a classification model training device, including:
至少一个存储器1101,被配置为存储可执行指令;At least one memory 1101, configured to store executable instructions;
至少一个处理器1102,与所述至少一个存储器1101耦合,当执行所述可执行指令时,被配置为:At least one processor 1102, coupled with the at least one memory 1101, when executing the executable instructions, is configured to:
获取第一训练数据,其中,所述第一训练数据包括目标设备的历史运行数据;Acquiring first training data, where the first training data includes historical operating data of the target device;
判断所述第一训练数据是否均衡;Judging whether the first training data is balanced;
如果所述第一训练数据不均衡,则向用户发送交互请求,其中,所述交互请求用于请求所述用户确定对所述第一训练数据进行处理的方式;If the first training data is unbalanced, sending an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;
接收用户响应于所述交互请求的均衡化处理指令,其中,所述均衡化处理指令包括至少一个数据集标识,所述至少一个数据集标识中的每一个所述数据集标识用于标识所述第一训练数据中导致所述第一训练数据不均衡的一个第一数据集,不同所述数据集标识所标识的所述第一数据集不同;Receive a user's equalization processing instruction in response to the interaction request, wherein the equalization processing instruction includes at least one data set identifier, and each of the at least one data set identifiers is used to identify the A first data set in the first training data that causes the first training data to be unbalanced, and the first data set identified by different data set identifiers is different;
根据所述均衡化处理指令,分别针对每一个所述数据集标识所标识的所述第一数据集对所述第一训练数据进行均衡化处理,获得第二训练数据,其中,所述第二训练数据中与各个所述数据集标识所对应的各个第二数据集不会导致所述第二训练数据不均衡;According to the equalization processing instruction, the first training data is equalized for each of the first data set identified by the data set identifier to obtain second training data, wherein the second Each second data set corresponding to each of the data set identifiers in the training data will not cause the second training data to be unbalanced;
利用所述第二训练数据训练与所述目标设备相对应的分类模型。Using the second training data to train a classification model corresponding to the target device.
可选地,在图11所示分类模型训练装置的基础上,所述至少一个处理器1102进一步在执行所述可执行指令时,被配置为:Optionally, based on the classification model training apparatus shown in FIG. 11, the at least one processor 1102 is further configured to: when executing the executable instruction:
对所述第一训练数据进行聚类处理,获得至少一个第三数据集,其中,每个所述第三数据集包括有至少一个样本,同一所述第三数据集中各个所述样本的数值位于同一数值范围内,不同所述第三数据集中所述样本的数值位于不同的数值范围;Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in In the same numerical range, the numerical values of the samples in the different third data sets are in different numerical ranges;
如果对所述第一训练数据进行聚类处理获得一个所述第三数据集,则确定所述第一训练数据均衡;If the first training data is clustered to obtain the third data set, determining that the first training data is balanced;
如果对所述第一训练数据进行聚类处理获得至少两个所述第三数据集,则If at least two third data sets are obtained by performing clustering processing on the first training data, then
从所述至少两个所述第三数据集中确定包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,以及Determining, from the at least two of the third data sets, a fourth data set including the smallest number of samples and a fifth data set including the largest number of samples, and
如果所述第四数据集与所述第五数据集中样本个数的比值小于预设的占比阈值,确定所述第一训练数据不均衡,以及If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than a preset proportion threshold, determining that the first training data is not balanced, and
如果所述第四数据集与所述第五数据集中样本个数的比值大于或等于所述占比阈值,确定所述第一训练数据均衡。If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced.
可选地,在图11所示分类模型训练装置的基础上,所述至少一个处理器1102进一步在执行所述可执行指令时,被配置为:Optionally, based on the classification model training apparatus shown in FIG. 11, the at least one processor 1102 is further configured to: when executing the executable instruction:
针对所述均衡化处理指令包括的每一个所述数据集标识,确定一个与该数据集标识 相对应的第六数据集,其中,该第六数据集中样本与该数据集标识所标识的所述第一数据集中样本的数值位于同一数值范围,且该数据集标识所标识的所述第一数据集与该第六数据集中样本的总数与所述第五数据集中样本个数的比值大于或等于所述占比阈值;For each of the data set identifiers included in the equalization processing instruction, a sixth data set corresponding to the data set identifier is determined, wherein the sample in the sixth data set and the data set identifier identified The numerical values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to The percentage threshold;
将与各个所述数据集标识相对应的各个所述第六数据集中的样本与所述第一训练数据进行组合,获得所述第二训练数据。Combining samples in each of the sixth data sets corresponding to each of the data set identifiers with the first training data to obtain the second training data.
可选地,在图11所示分类模型训练装置的基础上,所述至少一个处理器1102进一步在执行所述可执行指令时,被配置为:Optionally, based on the classification model training apparatus shown in FIG. 11, the at least one processor 1102 is further configured to: when executing the executable instruction:
从该数据集标识所标识的所述第一数据集中采集至少一个样本;Collecting at least one sample from the first data set identified by the data set identifier;
将包括有采集到的所述至少一个样本的样本集合作为与该数据集标识相对应的所述第六数据集。The sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.
可选地,在图11所示分类模型训练装置的基础上,所述至少一个处理器1102进一步在执行所述可执行指令时,被配置为:Optionally, based on the classification model training apparatus shown in FIG. 11, the at least one processor 1102 is further configured to: when executing the executable instruction:
接收所述用户响应于所述交互请求的数据重选指令,其中,所述数据重选指令包括数据读取地址;Receiving a data reselection instruction from the user in response to the interaction request, wherein the data reselection instruction includes a data read address;
根据所述数据重选指令,从与所述数据读取地址相对应的存储空间读取第三训练数据;Reading third training data from the storage space corresponding to the data reading address according to the data reselection instruction;
将所述第三训练数据作为所述第一训练数据,并执行所述判断所述第一训练数据是否均衡。The third training data is used as the first training data, and the judgment is performed to determine whether the first training data is balanced.
本发明还提供了一种计算机可读介质,存储用于使一计算机执行如本文所述的分类模型训练方法的指令。具体地,可以提供配有存储介质的系统或者装置,在该存储介质上存储着实现上述实施例中任一实施例的功能的软件程序代码,且使该系统或者装置的计算机(或CPU或MPU)读出并执行存储在存储介质中的程序代码。The present invention also provides a computer-readable medium that stores instructions for making a computer execute the classification model training method as described herein. Specifically, a system or device equipped with a storage medium may be provided, and the software program code for realizing the function of any one of the above embodiments is stored on the storage medium, and the computer (or CPU or MPU of the system or device) ) Read and execute the program code stored in the storage medium.
在这种情况下,从存储介质读取的程序代码本身可实现上述实施例中任何一项实施例的功能,因此程序代码和存储程序代码的存储介质构成了本发明的一部分。In this case, the program code itself read from the storage medium can realize the function of any one of the above embodiments, so the program code and the storage medium storing the program code constitute a part of the present invention.
用于提供程序代码的存储介质实施例包括软盘、硬盘、磁光盘、光盘(如CD-ROM、CD-R、CD-RW、DVD-ROM、DVD-RAM、DVD-RW、DVD+RW)、磁带、非易失性存储卡和ROM。可选择地,可以由通信网络从服务器计算机上下载程序代码。Examples of storage media used to provide program codes include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Magnetic tape, non-volatile memory card and ROM. Alternatively, the program code can be downloaded from the server computer via a communication network.
此外,应该清楚的是,不仅可以通过执行计算机所读出的程序代码,而且可以通过基于程序代码的指令使计算机上操作的操作系统等来完成部分或者全部的实际操作,从而实现上述实施例中任意一项实施例的功能。In addition, it should be clear that not only the program code read by the computer can be executed, but also some or all of the actual operations can be completed by the operating system operating on the computer through instructions based on the program code, so as to realize the above-mentioned embodiments. Function of any one of the embodiments.
此外,可以理解的是,将由存储介质读出的程序代码写到插入计算机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单元中设置的存储器中,随后基于程序代码的指令使安装在扩展板或者扩展单元上的CPU等来执行部分和全部实际操作,从而实现上述实施例中任一实施例的功能。In addition, it can be understood that the program code read from the storage medium is written to the memory provided in the expansion board inserted into the computer or to the memory provided in the expansion unit connected to the computer, and then the program code is based on The instructions cause the CPU installed on the expansion board or the expansion unit to perform part or all of the actual operations, so as to realize the function of any one of the foregoing embodiments.
需要说明的是,上述各流程和各系统结构图中不是所有的步骤和模块都是必须的,可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的,可以根据需要进行调整。上述各实施例中描述的系统结构可以是物理结构,也可以是逻辑结构,即,有些模块可能由同一物理实体实现,或者,有些模块可能分由多个物理实体实现,或者,可以由多个独立设备中的某些部件共同实现。It should be noted that not all steps and modules in the above-mentioned processes and system structure diagrams are necessary, and some steps or modules can be ignored according to actual needs. The order of execution of each step is not fixed and can be adjusted as needed. The system structure described in the foregoing embodiments may be a physical structure or a logical structure. That is, some modules may be implemented by the same physical entity, or some modules may be implemented by multiple physical entities, or may be implemented by multiple physical entities. Some components in independent devices are implemented together.
以上各实施例中,硬件单元可以通过机械方式或电气方式实现。例如,一个硬件单元可以包括永久性专用的电路或逻辑(如专门的处理器,FPGA或ASIC)来完成相应操作。硬件单元还可以包括可编程逻辑或电路(如通用处理器或其它可编程处理器),可以由软件进行临时的设置以完成相应操作。具体的实现方式(机械方式、或专用的永久性电路、或者临时设置的电路)可以基于成本和时间上的考虑来确定。In the above embodiments, the hardware unit can be implemented mechanically or electrically. For example, a hardware unit may include a permanent dedicated circuit or logic (such as a dedicated processor, FPGA or ASIC) to complete the corresponding operation. The hardware unit may also include programmable logic or circuits (such as general-purpose processors or other programmable processors), which may be temporarily set by software to complete corresponding operations. The specific implementation mode (mechanical method, or dedicated permanent circuit, or temporarily set circuit) can be determined based on cost and time considerations.
上文通过附图和优选实施例对本发明进行了详细展示和说明,然而本发明不限于这些已揭示的实施例,基与上述多个实施例本领域技术人员可以知晓,可以组合上述不同实施例中的代码审核手段得到本发明更多的实施例,这些实施例也在本发明的保护范围之内。The present invention has been shown and described in detail through the drawings and preferred embodiments above. However, the present invention is not limited to these disclosed embodiments. Based on the above multiple embodiments, those skilled in the art can know that the above different embodiments can be combined. The code review method in, obtains more embodiments of the present invention, and these embodiments are also within the protection scope of the present invention.

Claims (12)

  1. 分类模型训练方法,其特征在于,包括:The classification model training method is characterized by including:
    获取第一训练数据,其中,所述第一训练数据包括目标设备的历史运行数据;Acquiring first training data, where the first training data includes historical operating data of the target device;
    判断所述第一训练数据是否均衡;Judging whether the first training data is balanced;
    如果所述第一训练数据不均衡,则向用户发送交互请求,其中,所述交互请求用于请求所述用户确定对所述第一训练数据进行处理的方式;If the first training data is unbalanced, sending an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;
    接收用户响应于所述交互请求的均衡化处理指令,其中,所述均衡化处理指令包括至少一个数据集标识,所述至少一个数据集标识中的每一个所述数据集标识用于标识所述第一训练数据中导致所述第一训练数据不均衡的一个第一数据集,不同所述数据集标识所标识的所述第一数据集不同;Receive a user's equalization processing instruction in response to the interaction request, wherein the equalization processing instruction includes at least one data set identifier, and each of the at least one data set identifiers is used to identify the A first data set in the first training data that causes the first training data to be unbalanced, and the first data set identified by different data set identifiers is different;
    根据所述均衡化处理指令,分别针对每一个所述数据集标识所标识的所述第一数据集对所述第一训练数据进行均衡化处理,获得第二训练数据,其中,所述第二训练数据中与各个所述数据集标识所对应的各个第二数据集不会导致所述第二训练数据不均衡;According to the equalization processing instruction, the first training data is equalized for each of the first data set identified by the data set identifier to obtain second training data, wherein the second Each second data set corresponding to each of the data set identifiers in the training data will not cause the second training data to be unbalanced;
    利用所述第二训练数据训练与所述目标设备相对应的分类模型。Using the second training data to train a classification model corresponding to the target device.
  2. 根据权利要求1所述的方法,其特征在于,所述判断所述第一训练数据是否均衡,包括:The method according to claim 1, wherein the determining whether the first training data is balanced comprises:
    对所述第一训练数据进行聚类处理,获得至少一个第三数据集,其中,每个所述第三数据集包括有至少一个样本,同一所述第三数据集中各个所述样本的数值位于同一数值范围内,不同所述第三数据集中所述样本的数值位于不同的数值范围;Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in In the same numerical range, the numerical values of the samples in the different third data sets are in different numerical ranges;
    如果对所述第一训练数据进行聚类处理获得一个所述第三数据集,则确定所述第一训练数据均衡;If the first training data is clustered to obtain the third data set, determining that the first training data is balanced;
    如果对所述第一训练数据进行聚类处理获得至少两个所述第三数据集,则If at least two third data sets are obtained by performing clustering processing on the first training data, then
    从所述至少两个所述第三数据集中确定包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,以及Determining, from the at least two of the third data sets, a fourth data set including the smallest number of samples and a fifth data set including the largest number of samples, and
    如果所述第四数据集与所述第五数据集中样本个数的比值小于预设的占比阈值,确定所述第一训练数据不均衡,以及If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than a preset proportion threshold, determining that the first training data is not balanced, and
    如果所述第四数据集与所述第五数据集中样本个数的比值大于或等于所述占比阈值,确定所述第一训练数据均衡。If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述均衡化处理指令,分别针对每一个所述数据集标识所标识的所述第一数据集对所述第一训练数据进行均衡化处理,获得第二训练数据,包括:The method according to claim 2, wherein the first training data is equalized for each of the first data set identified by the data set identifier according to the equalization processing instruction To obtain the second training data, including:
    针对所述均衡化处理指令包括的每一个所述数据集标识,确定一个与该数据集标识相对应的第六数据集,其中,该第六数据集中样本与该数据集标识所标识的所述第一数据集中样本的数值位于同一数值范围,且该数据集标识所标识的所述第一数据集与该第六数据集中样本的总数与所述第五数据集中样本个数的比值大于或等于所述占比阈值;For each of the data set identifiers included in the equalization processing instruction, a sixth data set corresponding to the data set identifier is determined, wherein the sample in the sixth data set and the data set identifier identified The numerical values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to The percentage threshold;
    将与各个所述数据集标识相对应的各个所述第六数据集中的样本与所述第一训练数据进行组合,获得所述第二训练数据。Combining samples in each of the sixth data sets corresponding to each of the data set identifiers with the first training data to obtain the second training data.
  4. 根据权利要求3所述的方法,其特征在于,所述确定一个与该数据集标识相对应的第六数据集,包括:The method according to claim 3, wherein said determining a sixth data set corresponding to the data set identifier comprises:
    从该数据集标识所标识的所述第一数据集中采集至少一个样本;Collecting at least one sample from the first data set identified by the data set identifier;
    将包括有采集到的所述至少一个样本的样本集合作为与该数据集标识相对应的所述第六数据集。The sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.
  5. 根据权利要求1至4中任一所述的方法,其特征在于,在所述向用户发送交互请求之后,进一步包括:The method according to any one of claims 1 to 4, characterized in that, after the sending the interaction request to the user, it further comprises:
    接收所述用户响应于所述交互请求的数据重选指令,其中,所述数据重选指令包括数据读取地址;Receiving a data reselection instruction from the user in response to the interaction request, wherein the data reselection instruction includes a data read address;
    根据所述数据重选指令,从与所述数据读取地址相对应的存储空间读取第三训练数据;Reading third training data from the storage space corresponding to the data reading address according to the data reselection instruction;
    将所述第三训练数据作为所述第一训练数据,并执行所述判断所述第一训练数据是否均衡。The third training data is used as the first training data, and the judgment is performed to determine whether the first training data is balanced.
  6. 分类模型训练装置,其特征在于,包括:The classification model training device is characterized in that it includes:
    一个数据获取模块(701),用于获取第一训练数据,其中,所述第一训练数据包括目标设备的历史运行数据;A data acquisition module (701) for acquiring first training data, where the first training data includes historical operating data of the target device;
    一个数据判断模块(702),用于判断所述数据获取模块(701)获取到的所述第一训练数据是否均衡;A data judging module (702) for judging whether the first training data obtained by the data obtaining module (701) is balanced;
    一个请求发送模块(703),用于根据所述数据判断模块(702)的判断结果,如果所述第一训练数据不均衡,则向用户发送交互请求,其中,所述交互请求用于请求所述用户确定对所述第一训练数据进行处理的方式;A request sending module (703) is used to send an interactive request to the user according to the judgment result of the data judgment module (702) if the first training data is unbalanced, wherein the interactive request is used to request all The user determines the manner of processing the first training data;
    一个指令接收模块(704),用于接收用户响应于所述请求发送模块(703)所发送所述交互请求的均衡化处理指令,其中,所述均衡化处理指令包括至少一个数据集标识,所述至少一个数据集标识中的每一个所述数据集标识用于标识所述第一训练数据中导致 所述第一训练数据不均衡的一个第一数据集,不同所述数据集标识所标识的所述第一数据集不同;An instruction receiving module (704), configured to receive the equalization processing instruction of the interaction request sent by the user in response to the request sending module (703), wherein the equalization processing instruction includes at least one data set identifier, and Each of the at least one data set identifier is used to identify a first data set in the first training data that causes the first training data to be unbalanced, and the data set identifier is different from the one identified by the data set identifier The first data set is different;
    一个数据处理模块(705),用于根据所述指令接收模块(704)接收到的所述均衡化处理指令,分别针对每一个所述数据集标识所标识的所述第一数据集对所述第一训练数据进行均衡化处理,获得第二训练数据,其中,所述第二训练数据中与各个所述数据集标识所对应的各个第二数据集不会导致所述第二训练数据不均衡;A data processing module (705), which is configured to, according to the equalization processing instruction received by the instruction receiving module (704), respectively, for the first data set identified by each data set identifier, pair the Perform equalization processing on the first training data to obtain second training data, wherein each second data set corresponding to each data set identifier in the second training data does not cause the second training data to be unbalanced ;
    一个模型训练模块(706),用于利用所述数据处理模块(705)获取到的所述第二训练数据训练与所述目标设备相对应的分类模型。A model training module (706) is used to train a classification model corresponding to the target device using the second training data acquired by the data processing module (705).
  7. 根据权利要求6所述的装置,其特征在于,所述数据判断模块(702)包括:The device according to claim 6, characterized in that the data judgment module (702) comprises:
    一个聚类处理单元(7021),用于对所述第一训练数据进行聚类处理,获得至少一个第三数据集,其中,每个所述第三数据集包括有至少一个样本,同一所述第三数据集中各个所述样本的数值位于同一数值范围内,不同所述第三数据集中所述样本的数值位于不同的数值范围;A clustering processing unit (7021) is configured to perform clustering processing on the first training data to obtain at least one third data set, wherein each third data set includes at least one sample, and the same The numerical value of each sample in the third data set is in the same numerical range, and the numerical value of the sample in the different third data set is in different numerical ranges;
    一个第一判断单元(7022),用于在所述聚类处理单元(7021)对所述第一训练数据进行聚类处理获得一个所述第三数据集时,确定所述第一训练数据均衡;A first judging unit (7022) for determining that the first training data is balanced when the clustering processing unit (7021) performs clustering processing on the first training data to obtain a third data set ;
    一个第二判断单元(7023),用于在所述聚类处理单元(7021)对所述第一训练数据进行聚类处理获得至少两个所述第三数据集时,从所述至少两个所述第三数据集中确定包括样本个数最少的第四数据集和包括样本个数最多的第五数据集,如果所述第四数据集与所述第五数据集中样本个数的比值小于预设的占比阈值,确定所述第一训练数据不均衡,如果所述第四数据集与所述第五数据集中样本个数的比值大于或等于所述占比阈值,确定所述第一训练数据均衡。A second judging unit (7023), configured to obtain at least two third data sets from the at least two third data sets when the clustering processing unit (7021) performs clustering processing on the first training data The third data set is determined to include the fourth data set with the smallest number of samples and the fifth data set with the largest number of samples, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the expected Set the proportion threshold to determine that the first training data is not balanced. If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, determine the first training data Data balance.
  8. 根据权利要求7所述的装置,其特征在于,所述数据处理模块(705)包括:The device according to claim 7, wherein the data processing module (705) comprises:
    一个数据采集单元(7051),用于针对所述均衡化处理指令包括的每一个所述数据集标识,确定一个与该数据集标识相对应的第六数据集,其中,该第六数据集中样本与该数据集标识所标识的所述第一数据集中样本的数值位于同一数值范围,且该数据集标识所标识的所述第一数据集与该第六数据集中样本的总数与所述第五数据集中样本个数的比值大于或等于所述占比阈值;A data collection unit (7051) is configured to determine a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein, samples in the sixth data set The value of the sample in the first data set identified by the data set identifier is in the same numerical range, and the total number of samples in the first data set and the sixth data set identified by the data set identifier is the same as the fifth The ratio of the number of samples in the data set is greater than or equal to the proportion threshold;
    一个数据组合单元(7052),用于将所述数据采集单元(7051)确定出的各个所述第六数据集中的样本与所述第一训练数据进行组合,获得所述第二训练数据。A data combination unit (7052), configured to combine samples in each of the sixth data sets determined by the data collection unit (7051) with the first training data to obtain the second training data.
  9. 根据权利要求8所述的装置,其特征在于,The device according to claim 8, wherein:
    所述数据采集单元(7051),用于针对所述均衡化处理指令包括的每一个所述数据 集标识,从与该数据集标识所标识的所述第一数据集中采集至少一个样本,并将包括有采集到的所述至少一个样本的样本集合作为与该数据集标识相对应的所述第六数据集。The data collection unit (7051) is configured to collect at least one sample from the first data set identified by the data set identifier for each data set identifier included in the equalization processing instruction, and The sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.
  10. 根据权利要求6至9中任一所述的装置,其特征在于,进一步包括:一个数据重选模块(707);The device according to any one of claims 6 to 9, further comprising: a data reselection module (707);
    所述指令接收模块(704),进一步用于接收所述用户响应于所述请求发送模块(703)所发送所述交互请求的数据重选指令,其中,所述数据重选指令包括数据读取地址;The instruction receiving module (704) is further configured to receive a data reselection instruction of the interaction request sent by the user in response to the request sending module (703), wherein the data reselection instruction includes data read address;
    所述数据重选模块(707),用于根据所述指令接收模块(704)接收到的所述数据重选指令,从与所述数据读取地址相对应的存储空间读取第三训练数据,并将所述第三训练数据作为所述第一训练数据,之后触发所述数据判断模块(702)执行所述判断所述第一训练数据是否均衡。The data reselection module (707) is configured to read the third training data from the storage space corresponding to the data read address according to the data reselection instruction received by the instruction receiving module (704) And use the third training data as the first training data, and then trigger the data judging module (702) to execute the judging whether the first training data is balanced.
  11. 分类模型训练装置,其特征在于,包括:至少一个存储器(1101)和至少一个处理器(1102);The classification model training device is characterized by comprising: at least one memory (1101) and at least one processor (1102);
    所述至少一个存储器(1101),用于存储机器可读程序;The at least one memory (1101), used for storing machine-readable programs;
    所述至少一个处理器(1102),用于调用所述机器可读程序,执行权利要求1至5中任一所述的方法。The at least one processor (1102) is configured to invoke the machine-readable program to execute the method according to any one of claims 1 to 5.
  12. 计算机可读介质,其特征在于,所述计算机可读介质上存储有计算机指令,所述计算机指令在被处理器执行时,使所述处理器执行权利要求1至5中任一所述的方法。A computer-readable medium, wherein computer instructions are stored on the computer-readable medium, and when the computer instructions are executed by a processor, the processor executes the method according to any one of claims 1 to 5 .
PCT/CN2019/085054 2019-04-29 2019-04-29 Classification model training method and device, and computer-readable medium WO2020220220A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980095070.0A CN113692589A (en) 2019-04-29 2019-04-29 Classification model training method and device and computer readable medium
PCT/CN2019/085054 WO2020220220A1 (en) 2019-04-29 2019-04-29 Classification model training method and device, and computer-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/085054 WO2020220220A1 (en) 2019-04-29 2019-04-29 Classification model training method and device, and computer-readable medium

Publications (1)

Publication Number Publication Date
WO2020220220A1 true WO2020220220A1 (en) 2020-11-05

Family

ID=73029312

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085054 WO2020220220A1 (en) 2019-04-29 2019-04-29 Classification model training method and device, and computer-readable medium

Country Status (2)

Country Link
CN (1) CN113692589A (en)
WO (1) WO2020220220A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113929A (en) * 2023-09-08 2023-11-24 中电金信数字科技集团有限公司 Method and device for splitting field data, electronic equipment and storage medium
WO2024021350A1 (en) * 2022-07-28 2024-02-01 广州广电运通金融电子股份有限公司 Image recognition model training method and apparatus, computer device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250473A1 (en) * 2009-03-27 2010-09-30 Porikli Fatih M Active Learning Method for Multi-Class Classifiers
CN103593470A (en) * 2013-11-29 2014-02-19 河南大学 Double-degree integrated unbalanced data stream classification algorithm
CN105760889A (en) * 2016-03-01 2016-07-13 中国科学技术大学 Efficient imbalanced data set classification method
CN107305640A (en) * 2016-04-25 2017-10-31 中国科学院声学研究所 A kind of method of unbalanced data classification
CN108764366A (en) * 2018-06-07 2018-11-06 南京信息职业技术学院 Feature selecting and cluster for lack of balance data integrate two sorting techniques

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250473A1 (en) * 2009-03-27 2010-09-30 Porikli Fatih M Active Learning Method for Multi-Class Classifiers
CN103593470A (en) * 2013-11-29 2014-02-19 河南大学 Double-degree integrated unbalanced data stream classification algorithm
CN105760889A (en) * 2016-03-01 2016-07-13 中国科学技术大学 Efficient imbalanced data set classification method
CN107305640A (en) * 2016-04-25 2017-10-31 中国科学院声学研究所 A kind of method of unbalanced data classification
CN108764366A (en) * 2018-06-07 2018-11-06 南京信息职业技术学院 Feature selecting and cluster for lack of balance data integrate two sorting techniques

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021350A1 (en) * 2022-07-28 2024-02-01 广州广电运通金融电子股份有限公司 Image recognition model training method and apparatus, computer device, and storage medium
CN117113929A (en) * 2023-09-08 2023-11-24 中电金信数字科技集团有限公司 Method and device for splitting field data, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113692589A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN109800627B (en) Petroleum pipeline signal abnormity detection method and device, equipment and readable medium
CN111506489B (en) Test method, system, device, server and storage medium
CN110704231A (en) Fault processing method and device
US8593946B2 (en) Congestion control using application slowdown
WO2021008176A1 (en) Fault detection method and system for water chilling unit, and water chilling unit
CN108255725B (en) Test method and device
US10296410B2 (en) Forecasting workload transaction response time
US9632899B2 (en) Method for analyzing request logs in advance to acquire path information for identifying problematic part during operation
KR101848193B1 (en) Prediction method of disk capacity, equipment, facilities and non-volatile computer storage media
WO2017101301A1 (en) Data information processing method and device
CN109063433B (en) False user identification method and device and readable storage medium
CN108900319B (en) Fault detection method and device
WO2020220220A1 (en) Classification model training method and device, and computer-readable medium
CN109725220B (en) Detection method, system and device for transformer oil cooling loop
CN114117311B (en) Data access risk detection method and device, computer equipment and storage medium
CN111798241A (en) Transaction data processing method and device, electronic equipment and readable storage medium
CN112636942B (en) Method and device for monitoring service host node
US11777982B1 (en) Multidimensional security situation real-time representation method and system and applicable to network security
CN113096736A (en) Method and system for automatically analyzing viruses in real time based on nanopore sequencing
CN113792554A (en) Method and device for evaluating change influence based on knowledge graph
CN106980572B (en) Online debugging method and system for distributed system
US20100115088A1 (en) Configuration-information generating apparatus and configuration-information generating method
JP2013182468A (en) Parameter value setting error detection system, parameter value setting error detection method and parameter value setting error detection program
CN110619177B (en) Automatic identification method and device for structure operation modal parameters and storage medium
CN114003466A (en) Fault root cause positioning method for micro-service application program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926870

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926870

Country of ref document: EP

Kind code of ref document: A1