WO2020220220A1

WO2020220220A1 - Classification model training method and device, and computer-readable medium

Info

Publication number: WO2020220220A1
Application number: PCT/CN2019/085054
Authority: WO
Inventors: 周林飞; 吴超华; 施内加斯·丹尼尔; 田鹏伟; 李聪超; 吴文超
Original assignee: 西门子（中国）有限公司
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2020-11-05
Also published as: CN113692589A

Abstract

A classification model training method and device, and a computer-readable medium. The classification model training method comprises: acquiring first training data (101); determining whether the first training data is equalized (102); if the first training data is non-equalized, transmitting an interaction request to a user (103); receiving an equalization processing instruction sent by the user in response to the interaction request, wherein the equalization processing instruction comprises at least one data set identifier, and each data set identifier is used to identify a first data set in the first training data that causes the first training data to be unequalized (104); performing equalization processing, according to the equalization processing instruction, on the first training data for the first data set identified by each data set identifier, and acquiring second training data (105); and using the second training data to train a classification model corresponding to a target apparatus (106). The method improves the classification accuracy of a trained classification model.

Description

Classification model training method, device and computer readable medium

Technical field

The present invention relates to the field of data processing, in particular to a classification model training method, device and computer readable medium.

Background technique

During the normal operation of the equipment, the value of the equipment operating data will remain within a normal range. If the value of the operating data exceeds the normal range, it may be caused by the equipment failure. Therefore, the historical operation of the equipment during normal operation The data is used to train the classification model, and then the operating data of the equipment can be input into the classification model in real time, and the classification model can determine whether there is a failure during the operation of the equipment. For example, the refining equipment includes pipelines for transferring liquids. The working conditions of the pipelines can be determined by monitoring the liquid flow rate, pressure and temperature in the pipelines. For example, the flow rate data will decrease when the pipeline is blocked, and when the pipeline leaks Time pressure data will decrease.

When the historical operating data of the equipment is used as training data to train the classification model, the equipment may have a variety of different operating modes, and the training data may be historical operating data of the equipment in multiple operating modes, because the equipment is in different operating modes The generated running data is different, so the training data may be unbalanced. For example, the training data consists of a first data set and a second data set, where the first data set includes historical operating data when the device is running in the first operating mode, and the second data set includes the device running in the second operating mode. Yes, historical running data. In a possible situation, the number of samples in the first data set is much smaller than that in the second data set. In this case, the training data is not balanced. Among them, the judgment conditions used to judge the data imbalance can be determined according to the actual situation. For example, when the ratio of the number of samples in the first data set to the number of samples in the second data set is less than a preset threshold, it is determined that the training data is not balanced .

At present, the training data obtained is directly used to train the classification model. Since the training data may be unbalanced, if the unbalanced training data is used to train the classification model, one situation that may result is to use the trained classification model to perform real-time monitoring of the device. When analyzing the operating data of the above-mentioned first data set, the operating data falling within the corresponding numerical range of the first data set will erroneously conclude that the equipment is abnormal, so that a large number of false alarms will be generated when the classification model is used to determine the operating status of the equipment. The classification accuracy is low.

Summary of the invention

In view of this, the classification model training method, device and computer readable medium provided by the present invention can improve the classification accuracy of the trained classification model.

In the first aspect, an embodiment of the present invention provides a classification model training method, including:

Acquiring first training data, where the first training data includes historical operating data of the target device;

Determine whether the first training data is balanced;

If the first training data is unbalanced, send an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;

Receive a balance processing instruction from a user in response to an interaction request, where the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data An unbalanced first data set, the first data set identified by different data set identifiers is different;

According to the equalization processing instruction, the first training data is equalized for the first data set identified by each data set identifier to obtain the second training data, wherein the second training data corresponds to each data set identifier Each second data set of will not cause the second training data to be unbalanced;

The second training data is used to train a classification model corresponding to the target device.

After acquiring the first training data, determine whether the first training data is balanced. If the first training data is not balanced, send an interactive request to the user, requesting the user to determine the way to process the first training data. When the user responds After the equalization processing instruction of the interactive request, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device in the normal operation of the second training data will not cause the second training data. The training data is not balanced, and then the second training data is used to train the classification model. It can be seen that the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used The trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.

Optionally, when determining whether the first training data is balanced, first perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the same third data set The values of each sample are in the same numerical range, but the values of the samples in different third data sets are in different numerical ranges. After that, it is judged whether the clustering process only obtains a third data set. If only one third data set is obtained, the first The training data is balanced. If at least two third data sets are obtained, the fourth data set including the smallest number of samples and the fifth data set including the largest number of samples are determined from each third data set, and then the fourth data set is determined Whether the ratio of the number of samples in the fifth data set to the number of samples in the fifth data set is less than the preset proportion threshold, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold, it is determined that the first training data is not balanced. If the ratio of the number of samples in the data set to the fifth data set is greater than or equal to the proportion threshold, the first training data is determined to be balanced.

By clustering the first training data, each sample included in the first training data is clustered into one or more third data sets, that is, each sample with a value in the same numerical range is classified into the same third data concentrated. If the clustering process only obtains a third data set, it means that the values of each sample in the first training data are within the same numerical range, that is, the first training data is balanced. If the clustering process obtains multiple third data sets, calculate the ratio of the third data set with the smallest number of samples to the third data set with the largest number of samples. If the calculated ratio is less than the preset The percentage threshold of, indicates that the number of samples in the different third data sets is quite different, the sample distribution in the first training data is not balanced, it is determined that the first training data is not balanced, and vice versa. According to the number of the third data set and the number of samples in the third data set, it is determined whether the first training data is balanced in two stages, so as to ensure that the balance of the first training data can be accurately judged.

Optionally, when performing equalization processing on the first training data to obtain the second training data, for each data set identifier included in the equalization processing instruction, a sixth data set corresponding to the data set identifier is determined, The value of the sample in the sixth data set is within the same numerical range as the value of the sample in the first data set identified by the data set identifier, and it is necessary to ensure that the first data set identified by the data set identifier and the data set identifier The ratio of the total number of samples in the sixth data set to the number of samples in the fifth data set is greater than or equal to the proportion threshold. After the sixth data set corresponding to each data set identifier is determined, the determined samples in each sixth data set are combined with the first training data to obtain the second training data.

The first data set is determined by the user to cause the first training data to be unbalanced and the included samples are historical operating data when the target device is running normally. The first training data is unbalanced due to the small number of samples included in the first data set Therefore, a corresponding sixth data set is determined for each first data set, so that the value of the sample in the sixth data set and the value of the sample in the first data set are in the same numerical range, so that the Combining the samples in the sixth data set with the first training data essentially expands the samples in the first data set, so that the historical operating data of the target device in the second training data during normal operation will not cause the second training data to fail. balanced.

Optionally, when the sixth data set corresponding to the data set identifier is determined for a data set identifier, at least one sample may be collected from the first data set identified by the data set identifier, and then each sample collected The combination of is used as the sixth data set corresponding to the data set identifier.

When determining the sixth data set corresponding to the data set identifier, the value of the sample in the determined sixth data set is required to be within the same numerical range as the value of the sample in the first data set identified by the data set identifier. Collect samples directly from the first data set identified by the data set identifier as the sixth data set corresponding to the data set identifier, so that the first training data can be equalized more conveniently, because there is no need to find other Historical operating data can improve the efficiency of separate model training.

Optionally, after determining that the first training data is unbalanced and sending an interaction request to the user, if a data reselection instruction from the user in response to the interaction request is received, the data read address included in the data reselection instruction is read from the corresponding The storage space reads the third training data, and then uses the third training data as the first training data to restart the judgment on the balance of the first training data.

After determining that the first training data is unbalanced, if the user sends a data reselection instruction to instruct to reselect the training data, the first training data obtained before is discarded, and the first training data is reacquired according to the data reselection instruction. In this way, when the imbalance of the obtained first training data will lead to the inaccuracy of the trained classification model, the first training data with more balanced sample distribution can be re-selected to train the classification model, which can meet the needs of different users and improve The applicability of the classification model training method.

In the second aspect, the present invention also provides a classification model training device, including:

A data acquisition module for acquiring first training data, where the first training data includes historical operating data of the target device;

A data judgment module for judging whether the first training data obtained by the data obtaining module is balanced;

A request sending module is used to send an interactive request to the user if the first training data is unbalanced according to the judgment result of the data judgment module, where the interactive request is used to request the user to determine the way to process the first training data;

An instruction receiving module for receiving a balance processing instruction from a user in response to an interaction request sent by the request sending module, wherein the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used for In identifying a first data set in the first training data that causes the first training data to be unbalanced, the first data sets identified by different data set identifiers are different;

A data processing module is configured to perform equalization processing on the first training data for the first data set identified by each data set identifier according to the equalization processing instruction received by the instruction receiving module to obtain the second training data, wherein , Each second data set corresponding to each data set identifier in the second training data will not cause the second training data to be unbalanced;

A model training module is used to train the classification model corresponding to the target device using the second training data obtained by the data processing module.

After the data acquisition module obtains the first training data, the data judgment module determines whether the first training data is balanced, and requests the sending module to send an interaction request to the user after the data judgment module judges that the first training data is unbalanced according to the judgment result of the data judgment module When the instruction receiving module receives the equalization processing instruction of the user in response to the interaction request sent by the request sending module, the data processing module compares the first data set identified by each data set identifier in the equalization processing instruction to the first training data Perform equalization processing to obtain a second data set corresponding to the data set identifier that does not cause unbalanced second training data. The model training module uses the second training data obtained by the data processing module to train the corresponding target device Classification model. Before the model training module trains the classification model, if the first training data is unbalanced due to the small number of historical operating data samples corresponding to the normal operation of the target device in the first training data, the first training data is converted to the second training data. So that the historical operating data of the target device during normal operation will not cause the second training data to be unbalanced, and then use the second training data to train the classification model corresponding to the target device to ensure that the trained classification model will not be based on the normal target device The running data at runtime determines that the target device is abnormal, which can improve the classification accuracy of the trained classification model.

Optionally, the data judgment module includes:

A clustering processing unit is used to perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in Within the same numerical range, the values of samples in different third data sets are in different numerical ranges;

A first judging unit, configured to determine the balance of the first training data when the clustering processing unit performs clustering processing on the first training data to obtain a third data set;

A second judging unit for determining the fourth data with the least number of samples from the at least two third data sets when the clustering processing unit performs clustering processing on the first training data to obtain at least two third data sets If the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, it is determined that the first training data is not balanced. If the fourth data set is The ratio of the number of samples in the fifth data set is greater than or equal to the proportion threshold, and the first training data balance is determined.

The clustering processing unit may cluster the first training data into one or more third data sets, so that the values of samples in each third data set have the same numerical range. The first judgment unit and the second judgment unit are based on the aggregation The quantity of the third data set obtained by the class processing unit is subjected to subsequent processing. If the clustering processing unit obtains only one third data set, the first judgment unit determines that the first training data is balanced. If the clustering processing unit obtains at least two third data sets, the second judging unit first determines from each third data set the fourth data set including the smallest number of samples and the fifth data set including the largest number of samples, and then It is determined whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, and if so, it is determined that the first training data is unbalanced; otherwise, it is determined that the first training data is balanced. The first judgment unit and the second judgment unit determine whether the first training data is balanced in two stages based on the results of the clustering processing performed by the clustering processing unit. The judgment process combines the number of the third data set and the number of samples in the third data set , To ensure the accuracy of the balance judgment on the first training data.

Optionally, the data processing module includes:

A data collection unit for determining a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein the sample in the sixth data set is identified by the data set identifier The values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion threshold ；

A data combination unit is used to combine samples in each sixth data set determined by the data collection unit with the first training data to obtain second training data.

For each data set identifier included in the equalization processing instruction, the data collection unit may determine a sixth data set corresponding to the data set identifier, so that the value of the sample in the sixth data set is the same as the first data set identifier identified by the data set identifier. The values of samples in a data set are within the same numerical range, and it is ensured that the ratio of the total number of samples in the first data set to the number of samples in the fifth data set in the sixth data set and the data set identifier is greater than or equal to the proportion threshold. The data combination unit may combine the samples in each sixth data set determined by the data collection unit with the first training data to obtain the combined second training data.

For each data set identifier, the data collection unit determines a sixth data set that includes at least one sample based on the value range of the sample in the first data set identified by the data set identifier to expand the first data set identified by the data set identifier. A data set, after the data combination unit combines the samples in each sixth data set with the first training data, the samples in the second data set corresponding to the data set identifier in the second training data set correspond to the data set identifier The sixth data set and the sum of the samples in the first data set identified by the data set identifier, so that the second data set corresponding to each data set identifier will not cause the second training data to be unbalanced, so that the unevenness will be achieved through equalization processing Convert the first training data to the second training data.

Optionally, for each data set identifier, the data collection unit may collect samples from the first data set identified by the data set identifier, and then use the collected sample set as the sixth data set identifier corresponding to the data set identifier. data set.

The data collection unit directly collects samples from the first data set identified by the data set identifier, and then uses the collected sample set as the sixth data set corresponding to the data set identifier, because it corresponds to the first data set identified by the same data set. The data set and the samples in the sixth data set are combined to form the second data set corresponding to the data set identifier in the second training data. This ensures that the value of the sample in the sixth data set is the same as the value of the sample in the first data set. The values are in the same numerical range to ensure the effectiveness and accuracy of equalizing the first training data.

Optionally, the classification model training device may further include a data reselection module. After the instruction receiving module receives the data reselection instruction from the user in response to the interaction request, the data reselection module may read the data according to the data read address included in the data reselection instruction. The third training data is read from the corresponding storage space, and the third training data is used as the first training data to trigger the data judgment module to verify the balance of the new first training data.

The data reselection module can reselect the training data according to the user's instructions to meet the user's need to discard the previous first training data and reselect the training data used to train the classification model, so as to meet different usage needs and help improve The applicability of the classification model for training.

In a third aspect, an embodiment of the present invention also provides a classification model training device, including: at least one memory and at least one processor;

At least one memory for storing machine-readable programs;

At least one processor is configured to invoke a machine-readable program to execute the foregoing first aspect or the method provided in any possible implementation manner of the first aspect.

Wherein, a machine-readable program is stored in the memory, and the processor can execute the above-mentioned first aspect or the method provided in any implementable manner of the first aspect by calling the machine-readable program stored in the memory, and after obtaining the first aspect After the training data, determine whether the first training data is balanced. If the first training data is not balanced, send an interactive request to the user, requesting the user to determine the way to process the first training data. When the user responds to the equalization of the interactive request After the instruction is processed, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device during normal operation in the second training data will not cause the second training data to be unbalanced, Then use the second training data to train the classification model. It can be seen that the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used The trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.

In a fourth aspect, an embodiment of the present invention also provides a computer-readable medium on which computer instructions are stored. When the computer instructions are executed by a processor, the processor executes the first aspect or the first aspect described above. The method provided by any possible implementation.

Wherein, computer instructions are stored on the machine-readable medium, and when the computer instructions are executed by the processor, the processor will execute the distributed model training method provided by the first aspect and any possible implementation of the first aspect, and obtain After the first training data is received, it is judged whether the first training data is balanced. If the first training data is not balanced, an interactive request is sent to the user, requesting the user to determine the way to process the first training data, when the user responds to the interactive request After the equalization processing instruction of the equalization processing instruction, the first training data is equalized according to the equalization processing instruction to obtain the second training data, so that the historical operation data of the target device in the normal operation of the second training data will not cause the second training data Unbalanced, and then use the second training data to train the classification model. It can be seen that the unbalanced first training data is converted into the second training data, so that the historical operating data of the target device in the normal operation of the second training data will not cause the second training data to be unbalanced, so that the second training data is used The trained classification model will not erroneously give the abnormal analysis result of the target device based on the operating data of the target device during normal operation, so as to ensure that the trained classification model has a higher classification accuracy.

Description of the drawings

Other features, characteristics, advantages and benefits of the present invention will become more apparent through the following detailed description in conjunction with the accompanying drawings.

Figure 1 is a flowchart of a classification model training method provided by an embodiment of the present invention;

2 is a flowchart of a first training data balance judgment method according to an embodiment of the present invention;

3 is a flowchart of a method for equalizing and processing first training data according to an embodiment of the present invention;

FIG. 4 is a flowchart of a sixth data set determination method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for reselecting training data according to an embodiment of the present invention;

Fig. 6 is a flowchart of another classification model training method provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a classification model training device provided by an embodiment of the present invention;

Fig. 8 is a schematic diagram of another classification model training device provided by an embodiment of the present invention;

9 is a schematic diagram of another classification model training device provided by an embodiment of the present invention;

Fig. 10 is a schematic diagram of a classification model training device including a data re-module provided by an embodiment of the present invention;

Fig. 11 is a schematic diagram of still another classification model training device provided by an embodiment of the present invention.

List of reference signs:

101: Obtain the first training data

102: Determine whether the first training data is balanced

103: Send an interactive request to the user when the first training data is not balanced

104: Receive the equalization processing instruction from the user in response to the interaction request

105: Perform equalization processing on the first training data according to the equalization processing instruction to obtain second training data

106: Use the second training data to train the classification model

201: Perform clustering processing on the first training data to obtain at least one third data set

202: Determine whether the clustering process only obtains a third data set

203: Determine the first training data balance

204: Determine the fourth data set and the fifth data set from each third data set

205: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold

206: Determine the first training data is not balanced

301: Determine the sixth data set corresponding to each data set identifier respectively

302: Combine samples in each sixth data set with the first training data to obtain second training data

401: Collect at least one sample from the first data set identified by the data set identifier

402: Use the sample set including the collected samples as the sixth data set corresponding to the data set identifier

501: Receive a data reselection instruction from the user in response to an interactive request

502: Read the third training data according to the data reselection instruction

503: Use the third training data as the first training data, and perform step 102

601: Obtain the first training data

602: Perform clustering processing on the first training data to obtain at least one third data set

603: Determine whether to obtain only one third data set

604: Use the first training data to train a classification model

605: Determine the fourth data set and the fifth data set from each third data set

606: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold

607: Send an interactive request to the user

608: Determine whether a model training instruction from the user in response to the interaction request is received

609: Determine whether a data reselection instruction from the user in response to the interaction request is received

610: Re-acquire the first training data according to the data reselection instruction

611: Determine whether an equalization processing instruction from the user in response to the interaction request is received

612: Determine respectively the sixth data set corresponding to each data set identifier in the equalization processing instruction

613: Combine samples in each sixth data set with the first training data to obtain second training data

614: Use the second training data to train the classification model

615: End the current process

701: Data acquisition module 702: Data judgment module 703: Request to send module

704: Instruction receiving module 705: Data processing module 706: Model training module

707: Data Reselection Module 7021: Clustering Processing Unit 7022: First Judging Unit

7023: Second judgment unit 7051: Data collection unit 7052: Data combination unit

1101: Memory 1102: Processor

Detailed ways

As mentioned earlier, the historical operating data of the device is directly used as the training data to train the classification model. When the training data is not balanced, if the training data is all historical operating data of the device during normal operation, the trained classification model will consider the training The samples in the first data set that include a small number of samples are operating data generated when the device is abnormal. When using such a classification model to analyze the operating data of the device in real time, it is based on the value falling into the corresponding value range of the first data set The classification model will conclude that the equipment is abnormal from the normal operating data in the data, so the classification model will generate a large number of false alarms, making the classification accuracy of the classification model low.

In the embodiment of the present invention, after acquiring the first training data used to train the classification model, it is first judged whether the first training data is balanced, and when it is determined that the first training data is not balanced, an interaction request is sent to the user, and the user determines that the first training data is not balanced. A method of processing training data. Afterwards, if a user's equalization processing instruction in response to an interactive request is received, the first training data is equalized into the second training data according to the equalization processing instruction, so that the target device can run normally The historical running data of ”will not cause the second training data to be unbalanced, and then use the second training data to train the classification model corresponding to the target device. It can be seen that when there is historical operating data during normal operation of the device that causes the first training data to be unbalanced, the second training data is obtained by equalizing the first training data, so that the historical operating data during normal operation of the device is not balanced. It will cause the second training data to be unbalanced, thereby reducing the false alarm probability of the classification model trained by using the second training data, and improving the classification accuracy of the trained classification model.

The classification model training method and device provided by the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in Fig. 1, an embodiment of the present invention provides a classification model training method, which may include the following steps:

Step 101: Acquire first training data, where the first training data includes historical operating data of the target device;

Step 102: Determine whether the first training data is balanced;

Step 103: If the first training data is not balanced, send an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;

Step 104: Receive a balance processing instruction from the user in response to the interaction request, where the balance processing instruction includes at least one data set identifier, and each data set identifier in the at least one data set identifier is used to identify the first training data that causes the first training data. A first data set with unbalanced training data, and the first data sets identified by different data set identifiers are different;

Step 105: According to the equalization processing instruction, perform equalization processing on the first training data for the first data set identified by each data set identifier to obtain second training data, where the second training data and each data set Each second data set corresponding to the identifier will not cause the second training data to be unbalanced;

Step 106: Use the second training data to train a classification model corresponding to the target device.

In the classification model training method provided by the embodiment of the present invention, after obtaining the first training data including the historical operating data of the target device, it is judged whether the first training data is balanced, and when it is determined that the first training data is not balanced, an interaction is sent to the user The request is determined by the user to process the first training data. After receiving the equalization processing instruction from the user in response to the interaction request, it is directed to the first training data identified by each data set identifier included in the equalization processing instruction. A data set performs equalization processing on the first training data. In the second training data obtained by the equalization processing, the second data set corresponding to each data set identifier will not cause the second training data to be unbalanced, and then use The obtained second training data is used to train the classification model corresponding to the target device. It can be seen that when it is determined that the first training data is unbalanced, the user determines whether the first training data needs to be equalized and the first data set for which the equalization is processed. When the user determines to perform the first training data During the equalization process, the second training data is obtained by equalizing the first training data, so that the historical operating data of the target device during normal operation in the second training data will not cause the second training data to be unbalanced, thus using the second training data. The classification model trained by the training data will not be misjudged due to the imbalance of the training data, ensuring that the trained classification model has a high classification accuracy.

In the embodiment of the present invention, when the first training data is obtained in step 101, the historical operation data including a certain number of samples can be selected from the historical operation data of the target device as the first training data. Specifically, the target device can be set as the first training data in a continuous period of time. The historical operating data of the device is used as the first training data, and the historical operating data of the target device in multiple discontinuous time periods may also be used as the first training data. In addition, because the operating environment of different devices is not completely the same, when a classification model needs to be trained for a device to monitor whether the device is abnormal through the classification model, in order to ensure that the trained classification model has a higher classification accuracy, it is necessary to The historical operating data of the device is used as training data to train the classification model.

In the embodiment of the present invention, when it is determined that the first training data is unbalanced and an interaction request is sent to the user, the interaction request may include multiple alternatives, so that the user can select the one to process the first training data from each alternative. the way. The interaction request may include the following three alternative options: training a classification model based on existing training data, reselecting training data, and performing equalization processing on training data. The following describes the processing methods after the user selects the above three alternatives:

When the user chooses to train the classification model based on the existing training data, the first training data is directly used to train the classification model corresponding to the target device. In this case, at least one of the first training data including a small number of samples in the first data set causes the first training data to be unbalanced, and each of the samples included in the first data set is the historical operation when the target device operates abnormally Data, at this time, directly use the first training data to train the classification model. When using the trained classification model to analyze the operating data of the target device in real time, if the operating data falls into any value corresponding to the first data set Range, the classification model will give the conclusion that the target device is abnormal.

When the user chooses to re-select the training data, the training data is re-read according to the data storage address provided by the user, and the re-read training data is used as the first training data to start step 102. In this case, the user confirms that the first training data selected for training the classification model is wrong, reselects the training data according to the user's instruction, and further determines whether the reselected training data is balanced.

When the user chooses to balance the training data, each data set that causes the first training data to be unbalanced is shown to the user, and the user selects the first data set from the displayed data sets, and the user can generate the The equalization processing instruction identified by each data set of each first data set. In this case, when determining whether the first training data is balanced, at least one data set that causes the first training data to be unbalanced can be determined, but a small number of samples included in the data set may be historical operating data when the target device is operating abnormally , It may also be the historical operating data of the target device during normal operation. At this time, it needs to be distinguished by the user, and the data set containing a small number of samples as historical operating data of the target device during normal operation is determined as the first data set. A data set performs equalization processing on the first training data. In this way, in the second training data obtained by equalizing the first training data, the historical operating data of the target device during normal operation will not cause the second training data to be unbalanced, while the historical operating data of the target device during abnormal operation This may cause the second training data to be unbalanced. At this time, the classification model trained by using the second training data can more accurately identify whether the target device is abnormal.

In addition, when sending an interaction request to the user, the above three alternatives can be displayed to the user in the form of a prompt box, and then further interaction with the user is carried out according to the alternatives selected by the user, or the second alternative is directly based on the alternatives selected by the user. One training data is processed.

In the embodiment of the present invention, when determining whether the first training data is balanced in step 102, if the judgment result is that the first training data is balanced, the first training data is directly used to train the classification model corresponding to the target device.

In the embodiment of the present invention, when step 106 uses the second training data to train the classification model corresponding to the target device, the second training data can be used as input to train the classification model through various types of machine algorithms. The specific machine algorithm can be Artificial neural network algorithms, deep learning algorithms, core-based algorithms, integrated algorithms, genetic algorithms, etc.

Optionally, based on the classification model training method shown in FIG. 1, when step 102 determines whether the first training data is balanced, the first training data may be clustered to cluster the first training data into one or more Data sets, and then determine whether the first training data is balanced according to the ratio of the number of data sets obtained by clustering and the number of samples in the data set. Specifically, as shown in Figure 2, determining whether the first training data is balanced can be achieved through the following steps:

Step 201: Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the values of each sample in the same third data set are within the same numerical range, The values of the samples in different third data sets are in different numerical ranges;

Step 202: Determine whether clustering of the first training data only obtains a third data set, if yes, go to step 203, otherwise go to step 204;

Step 203: Determine the first training data balance, and end the current process;

Step 204: Determine, from at least two third data sets, a fourth data set that includes the smallest number of samples and a fifth data set that includes the largest number of samples;

Step 205: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, if yes, go to step 206, otherwise go to step 203;

Step 206: Determine that the first training data is not balanced.

In the embodiment of the present invention, by performing clustering processing on the first training data, the first training data can be clustered into at least one third data set, so that each third data set includes at least one sample, and the same The values of each sample in the three data sets are in the same numerical range, and the values of the samples in the different third data sets are in different numerical ranges. The first training data is composed of historical operating data of the target device. Since the value of the operating data of the target device is in the same numerical range when the target device is running normally in the same operating mode, one or more can be obtained by clustering the first training data. For multiple third data sets, the values of each sample in the same third data set are within the same numerical range. For example, the target device has two operating modes. When the target device is operating normally in the first operating mode, the value range of operating data is 50～80, and when the target device is operating normally in the second operating mode, the value range of operating data is 120~150, the third data set 1, the third data set 2, and the third data set 3 are obtained by clustering the first training data to obtain three third data sets, where each sample in the third data set 1 The value of is in the range of 50-80, the value of each sample in the third data set 2 is in the range of 120-150, and the value of each sample in the third data set 3 is in the range of 200-240.

In the embodiment of the present invention, after at least one third data set is obtained by performing clustering processing on the first training data, first, a preliminary judgment is made on whether the first training data is balanced according to the number of the third data sets. If the clustering process only obtains one third data set, that is, the values of all samples included in the first training data are within the same numerical range, then the first training data can be determined to be balanced; if the clustering process obtains at least two third For the data set, it is necessary to further determine whether the first training data is balanced according to the number of samples in each third data set.

In the embodiment of the present invention, after clustering the first training data to obtain at least two third data sets, it is determined from the obtained at least two third data sets that the fourth data set with the smallest number and the The fifth data set with the largest number of samples, then compare the number of samples in the fourth data set and the fifth data set, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, It is determined that the first training data is unbalanced, and if the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced. Since the fourth data set is the data set with the smallest number of samples in all the third data sets, and the fifth data set is the data set with the largest number of samples in all the third data sets, that is, the fourth data set and the fifth data set The difference in the number of samples is the largest. If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than the preset proportion threshold, it means that there is at least one sample number in the third data set in the first training data. There is a large gap in the number of samples in the data set, that is, data imbalance appears in the first training data.

In the embodiment of the present invention, according to actual business scenarios, the proportion threshold can be flexibly set within the range of 0.5 to 1.0. The larger the proportion threshold, the higher the sensitivity of equalization detection on the first training data. The proportion threshold can be specifically The value is 0.6, 0.7, 0.85, or 0.9.

In the embodiment of the present invention, when performing clustering processing on the first training data in step 201, the first training data can be clustered through K-means clustering algorithm, spectral clustering algorithm, Gaussian mixture model clustering algorithm, etc. The training data is clustered.

One or more third data sets are obtained by clustering the first training data. According to the number of third data sets and the number of samples included in the third data set, it is determined whether the first training data is balanced in two stages , Ensuring that the balance of the first training data can be accurately judged, thereby ensuring that the classification model trained by using the first training data or the second training data has high classification accuracy.

It should be noted that, on the basis of the classification model training method shown in FIG. 1, when determining whether the first training data is balanced in step 102, except for the foregoing embodiment to determine whether the first training data is balanced through clustering processing, Other methods can also be used to determine whether the first training data is balanced. For example, the first training data can be determined by performing histogram curve fitting to determine whether the first training data is balanced, and the first training data can also be estimated by estimating the distribution of the first training data. 1. Whether the training data is balanced.

Optionally, based on the first training data balance judgment method shown in FIG. 2, when it is determined that the first training data is unbalanced, and the first training data is equalized according to the equalization processing instruction from the user, The problem of unbalanced first training data can be solved by adding samples to the first training data. Specifically, as shown in FIG. 3, the first training data can be equalized in the following manner:

Step 301: For each data set identifier included in the equalization processing instruction, determine a sixth data set corresponding to the data set identifier, wherein the sample in the sixth data set and the first data identified by the data set identifier The values of the concentrated samples are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion threshold;

Step 302: Combine the determined samples in each sixth data set corresponding to each data set identifier with the first training data to obtain second training data.

In the embodiment of the present invention, for each data set identifier included in the equalization processing instruction, the data set identifier is used to identify a corresponding first data set, and the first data set is obtained by the user from each third data set. In the data set selected in a centralized manner, the first training data is not balanced due to the small number of samples in the first data set. Therefore, a corresponding sixth data set can be determined for the data set identifier, and the sixth data set includes There is at least one sample, the value of each sample in the sixth data set is in the same value range as the value of the sample in the first data set, and the total number of samples in the sixth data set and the first data set is equal to the number of samples in the fifth data set The ratio of the number is greater than or equal to the proportion threshold. The samples in the sixth data set corresponding to each data set identifier are combined with the first training data to obtain second training data. In the second training data, for each first data set that causes the first training data to be unbalanced, since the total number of samples in the first data set and the corresponding sixth data set is greater than or equal to the proportion threshold, the first data set When one data set and the corresponding sixth data set are viewed as a whole, the second training data will not be unbalanced.

For example, following the example in the foregoing embodiment, the equalization processing instruction includes a data set identifier 1. The data set identifier 1 is used to identify the first data set 1 obtained by processing the first training data, and the first data set 1 is the third data set 2 in the foregoing embodiment. The first training data is not balanced due to the small number of samples included in the first data set 1 compared to the third data set 1. Therefore, one is determined for the data set identifier 1. The sixth data set, the values of the samples included in the sixth data set are all in the range of 120-150 (the same value range as the samples in the first data set 1), and the sixth data set is the same as the samples in the first data set 1. The ratio of the total number of to the number of samples in the fifth data set (here, the third data set 1) is greater than or equal to the proportion threshold. In this way, since the samples in the first data set 1 and the sixth data set have the same light value range, the first data set 1 and the sixth data set are samples of the same kind, and the first data set 1 and the sixth data set are combined together Will not cause the second training data to be unbalanced.

Optionally, on the basis of the method of equalizing the first training data shown in FIG. 3, when determining the corresponding sixth data set for each data set identifier in step 301, the first data identified by the data set identifier may be Collect samples collectively as the sixth data set corresponding to the data set identifier. For each data set identifier, the method for determining the sixth data set corresponding to the data set identifier is shown in FIG. 4, which may specifically include the following steps:

Step 401: Collect at least one sample from the first data set identified by the data set identifier;

Step 402: Use a sample set including at least one collected sample as a sixth data set corresponding to the data set identifier.

In the embodiment of the present invention, for each data set identifier, in order to obtain samples in the same numerical range as the sample in the first data set identified by the data set identifier, the first data set identifier identified by the data set identifier Samples are collected in a data set, and then a data set including each collected sample is used as a sixth data set corresponding to the data set identifier. For example, for the data set identifier 1 included in the equalization processing instruction, since the data set identifier 1 is used to identify the first data set 1, a corresponding number of samples can be collected from the first data set 1, and the collected samples The formed data set serves as the sixth data set corresponding to the data set identifier 1.

Since it is ultimately necessary to combine the determined samples in the sixth data set with the first training data to obtain the second training data, and the samples in the sixth data set are collected from the first data set, it is essentially Part or all of the samples in the first data set are copied one or more times to obtain the sixth data set, which not only ensures that the samples in the sixth data set and the corresponding samples in the first data set have the same value range, but also ensures that the sixth data is determined Convenience of the set.

In the embodiment of the present invention, when collecting samples from the first data set, all samples in the first data set may be copied one or more times, and random sampling methods may also be used to randomly collect samples from the first data set.

In addition, when determining the sixth data set corresponding to the data set identifier, in addition to collecting samples from the corresponding first data set in the manner shown in FIG. 4 to obtain the sixth data set, other methods may also be used to obtain the first data set. Six data sets, for example, samples can be collected from the historical operating data of the target device, and new samples can be directly generated based on the samples in the first data set.

Optionally, on the basis of the classification model training methods provided in the foregoing embodiments, after step 103 sends an interaction request to the user, if the user instructs to reselect the training data, the training data can be reselected according to the user's instruction to solve this problem. The first problem of unbalanced training data. Specifically, as shown in FIG. 5, the method for reselecting training data may include the following steps:

Step 501: Receive a data reselection instruction from a user in response to an interaction request, where the data reselection instruction includes a data read address;

Step 502: According to the data reselection instruction, read the third training data from the storage space corresponding to the data read address;

Step 503: Use the third training data as the first training data, and execute to determine whether the first training data is balanced.

After sending the interaction request to the user, if the user sends a data reselection instruction in response to the interaction request, the third training data is read from the corresponding storage space according to the data read address included in the data reselection instruction, and then the third training data is read The third training data of is used as the first training data, and step 102 is restarted.

After determining that the first training data is unbalanced, send an interactive request to the user, and the user can send a data reselection instruction to reselect the training data used to train the classification model, so as to solve the unbalanced first training data previously selected The problem is that this provides users with another way to deal with data imbalance, which can improve the user experience.

The following is an example of training a classification model that determines whether the motor is operating abnormally based on the current data during the operation of the motor, and further describes the classification model training method provided by the embodiment of the present invention. As shown in FIG. 6, the method may include the following steps:

Step 601: Obtain first training data.

In the embodiment of the present invention, when training a classification model for analyzing whether motor A is abnormal, a certain amount of historical operating data is obtained from the historical operating data of motor A as the first training data, specifically obtaining the information of motor A Current data. For example, the current data of motor A running in the past 3 months is acquired as the first training data.

Step 602: Perform clustering processing on the first training data to obtain at least one third data set.

In the embodiment of the present invention, after the first training data is obtained, the first training data is clustered, and the first training data is clustered into at least one third data set, so that each third data set includes At least one sample, and the values of each sample in the same third data set are within the same numerical range, and the values of the samples in different third data sets are within different numerical ranges. That is, each third data set corresponds to a numerical range, and the numerical ranges corresponding to different third data sets do not overlap. In addition, each sample corresponds to a piece of current data.

For example, by clustering the first training data, three third data sets are obtained, namely, the third data set 1, the third data set 2, and the third data set 3. The third data set 1 includes 8000 The third data set 2 includes 1000 samples, the third data set 3 includes 2000 samples, the value of each sample in the third data set 1 is in the range of 50 to 80, and the third data set 2 The value of each sample is in the range of 120-150, and the value of each sample in the third data set 3 is in the range of 200-240.

Step 603: Determine whether only one third data set is obtained, if yes, go to step 604, otherwise go to step 605.

In the embodiment of the present invention, after the first training data is clustered to obtain the third data set, it is determined whether the clustering process only obtains one third data set. If only one third data set is obtained, the first training For data equalization, step 604 is performed accordingly. If at least two third data sets are obtained, it is necessary to further determine whether the first training data is balanced, and step 605 is performed accordingly.

Step 604: Use the first training data to train the classification model, and end the current process.

In the embodiment of the present invention, when the first training data is clustered and only one third data set is obtained, it is explained that the first training data is balanced, and the first training data is directly used to train the classification model corresponding to motor A.

Step 605: Determine the fourth data set and the fifth data set from each third data set.

In the embodiment of the present invention, when at least two third data sets are obtained, the fourth data set including the smallest number of samples is determined from each third data set, and the fourth data set including the largest number of samples is determined from each third data set The fifth data set.

For example, because the number of samples in the third data set 1 is greater than the number of samples in the third data set 3, and the number of samples in the third data set 3 is greater than the number of samples in the third data set 2, the third data set 1 is determined as the fifth data set, and the third data set 2 is determined as the fourth data set.

Step 606: Determine whether the ratio of the number of samples in the fourth data set to the fifth data set is less than the proportion threshold, if yes, go to step 607, otherwise go to step 604.

In the embodiment of the present invention, after the fourth data set and the fifth data set are obtained, the ratio of the number of samples in the fourth data set and the fifth data set is compared with a preset threshold value. If the ratio of the number of samples in the data set to the number of samples in the fifth data set is less than the proportion threshold, it is determined that the first training data is unbalanced, and step 607 is performed accordingly. If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to If the proportion threshold is determined, the first training data is determined to be balanced, and step 604 is executed accordingly.

Step 607: Send an interaction request to the user.

In the embodiment of the present invention, when it is determined that the first training data is unbalanced, an interaction request is sent to the user, requesting the user to determine a method for processing the first training data.

Step 608: Determine whether a model training instruction from the user in response to the interaction request is received, if yes, execute step 604, otherwise execute step 609.

In the embodiment of the present invention, after the interaction request is sent to the user, if a model training instruction from the user in response to the interaction request is received, where the model training instruction is used to instruct to use the existing first training data to train the classification model, execute the step 604 uses the first training data to train a classification model corresponding to motor A.

Step 609: Determine whether a data reselection instruction from the user in response to the interaction request is received, if yes, go to step 610, otherwise go to step 611.

In the embodiment of the present invention, after sending an interaction request to the user, if a data reselection instruction from the user in response to the interaction request is received, the training data for training the classification model needs to be reselected, and step 610 is performed accordingly.

Step 610: Re-acquire the first training data according to the data reselection instruction, and execute step 602.

In the embodiment of the present invention, after the data reselection instruction is received, the third training data is read from the corresponding storage space according to the data read address included in the data reselection instruction, and the read third training data is used as After the first training data, step 602 is executed.

For example, the data read address included in the data reselection instruction is the current data storage address of motor A, and then the current data of motor A is read from the current data storage address as the new first training data, and then step 602 is restarted.

Step 611: Determine whether the equalization processing instruction of the user in response to the interaction request is received, if yes, execute step 612, otherwise execute 615.

In the embodiment of the present invention, after sending an interaction request to the user, if the user receives an equalization processing instruction in response to the interaction request, the first training data needs to be equalized, and step 612 is performed accordingly. Otherwise, it indicates that the user No corresponding instructions are given and the current process ends.

Step 612: Determine the sixth data set corresponding to each data set identifier included in the equalization processing instruction.

In the embodiment of the present invention, after the equalization processing instruction is received, at least one data set identifier included in the equalization processing instruction is acquired, each data set identifier is used to identify a first data set, and different data set identifiers are used for Identify different first data sets, where the first data set is a data set selected by the user from each third data set that causes the first training data to be unbalanced. For each acquired data set identifier, determine a sixth data set corresponding to the data set identifier, where the sixth data set includes at least one sample, the value of the sample in the sixth data set and the data set identifier The values of the samples in the first data set identified are in the same numerical range, and the ratio of the total number of samples in the first data set identified by the sixth data set and the data set identifier to the number of samples in the fifth data set is greater than or equal to Ratio threshold. Specifically, for each data set identifier, samples may be collected from the first data set identified by the data set identifier, and the collected samples may be combined into a sixth data set corresponding to the data set identifier. Further, for each data set identifier, the number of samples in the sixth data set corresponding to the data set identifier satisfies the following conditions: the sixth data set corresponding to the data set identifier and the first data identified by the data set identifier The ratio of the total number of samples in the concentration to the number of samples in the fifth data set is equal to the proportion threshold.

For example, if the equalization processing instruction includes the data set identifier 1, the data set identifier 1 is used to identify the third data set 2, and the preset ratio threshold is 0.8, then 5400 samples are collected from the third data set 2, and The collected 5400 samples are taken as the sixth data set corresponding to data set identifier 1. It should be noted that, after user interaction, the user determines that the samples in the third data set 3 are current data when the motor A is running abnormally, so there is no need to perform equalization processing on the first training data for the third data set 3.

Step 613: Combine the samples in each sixth data set with the first training data to obtain second training data.

In the embodiment of the present invention, after obtaining the sixth data set corresponding to each data set identifier, the obtained samples in each sixth data set are combined with the first training data to obtain the second training data.

For example, 11000 samples included in the first training data are combined with 5400 samples included in the sixth data set corresponding to the data set identifier 1, to obtain second training data including 16,400 samples.

Step 614: Use the second training data to train the classification model.

In the embodiment of the present invention, after the second training data is obtained, the obtained second training data is used to train the classification model corresponding to the motor A.

Step 615: End the current process.

It should be noted that in the model training method shown in Figure 6, each step is split to explain the model training process more clearly. There is no absolute sequence between the steps in the actual business display process, such as , Step 609 and step 611 can be executed before step 608, step 611 can be executed before step 609, and so on.

As shown in FIG. 7, an embodiment of the present invention provides a classification model training device, including:

A data acquisition module 701 for acquiring first training data, where the first training data includes historical operating data of the target device;

A data judging module 702 for judging whether the first training data obtained by the data obtaining module 701 is balanced;

A request sending module 703 is used to send an interactive request to the user according to the judgment result of the data judgment module 702 if the first training data is not balanced, where the interactive request is used to request the user to determine how to process the first training data ；

An instruction receiving module 704 for receiving a balance processing instruction from a user in response to an interaction request sent by the request sending module 703, wherein the balance processing instruction includes at least one data set identifier, and each data set in the at least one data set identifier The identifier is used to identify a first data set in the first training data that causes the first training data to be unbalanced, and the first data sets identified by different data set identifiers are different;

A data processing module 705 is configured to perform equalization processing on the first training data for the first data set identified by each data set identifier according to the equalization processing instruction received by the instruction receiving module 704 to obtain the second training data , Wherein each second data set corresponding to each data set identifier in the second training data will not cause the second training data to be unbalanced;

A model training module 706 is configured to use the second training data obtained by the data processing module 705 to train a classification model corresponding to the target device.

In the embodiment of the present invention, the data acquisition module 701 can be used to perform step 101 in the above method embodiment, the data judgment module 702 can be used to perform step 102 in the above method embodiment, and the request sending module 703 can be used to perform the above method embodiment. In step 103, the instruction receiving module 704 can be used to perform step 104 in the above method embodiment, the data processing module 705 can be used to perform step 105 in the above method embodiment, and the model training module 706 can be used to perform step 104 in the above method embodiment. Step 106.

Optionally, based on the classification model training device shown in FIG. 7, as shown in FIG. 8, the data judgment module 702 includes:

A clustering processing unit 7021 is configured to perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set In the same numerical range, the values of samples in different third data sets are in different numerical ranges;

A first judgment unit 7022, configured to determine the first training data balance when the cluster processing unit 7021 performs clustering processing on the first training data to obtain a third data set;

A second judging unit 7023 is configured to: when the clustering processing unit 7021 performs clustering processing on the first training data to obtain at least two third data sets, determine from the at least two third data sets the first with the least number of samples The fourth data set and the fifth data set including the largest number of samples, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the preset proportion threshold, it is determined that the first training data is not balanced, if the fourth data The ratio of the number of samples in the fifth data set is greater than or equal to the proportion threshold, and the first training data balance is determined.

In the embodiment of the present invention, the cluster processing unit 7021 can be used to perform step 201 in the above method embodiment, the first judgment unit 7022 can be used to perform step 203 in the above method embodiment, and the second judgment unit 7023 can be used to perform the above Steps 204 to 206 in the method embodiment.

Optionally, based on the classification model training device shown in FIG. 8, as shown in FIG. 9, the data processing module 705 includes:

A data collection unit 7051 is used to determine a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein the sample in the sixth data set is related to the data set identifier. The values of the samples in the first data set identified are in the same value range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to the proportion Threshold

A data combination unit 7052 is used to combine the samples in each sixth data set determined by the data collection unit 7051 with the first training data to obtain second training data.

In the embodiment of the present invention, the data collection unit 7051 can be used to perform step 301 in the above method embodiment, and the data combination unit 7052 can be used to perform step 302 in the above method embodiment.

Optionally, on the basis of the classification model training device shown in FIG. 9,

The data collection unit 7051 is configured to collect at least one sample from the first data set identified by the data set identifier for each data set identifier included in the equalization processing instruction, and include a sample containing the collected at least one sample Set as the sixth data set corresponding to the data set identifier.

In the embodiment of the present invention, the data collection unit 7051 may be used to execute step 401 and step 402 in the foregoing method embodiment.

Optionally, based on the classification model training device shown in any one of the drawings in FIG. 7 to FIG. 9, as shown in FIG. 10, the classification model training device may further include: a data reselection module 707;

The instruction receiving module 704 is further configured to receive a data reselection instruction sent by the user in response to the interaction request sent by the request sending module 703, where the data reselection instruction includes a data read address;

The data reselection module 707 is configured to read the third training data from the storage space corresponding to the data read address according to the data reselection instruction received by the instruction receiving module 704, and use the third training data as the first training data After that, the data judgment module 702 is triggered to judge whether the first training data is balanced.

In the embodiment of the present invention, the instruction receiving module 704 can be used to execute step 501 in the above method embodiment, and the data reselection module 707 can be used to execute step 502 and step 503 in the above method embodiment.

As shown in FIG. 11, an embodiment of the present invention provides a classification model training device, including:

At least one memory 1101, configured to store executable instructions;

At least one processor 1102, coupled with the at least one memory 1101, when executing the executable instructions, is configured to:

Judging whether the first training data is balanced;

If the first training data is unbalanced, sending an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;

Receive a user's equalization processing instruction in response to the interaction request, wherein the equalization processing instruction includes at least one data set identifier, and each of the at least one data set identifiers is used to identify the A first data set in the first training data that causes the first training data to be unbalanced, and the first data set identified by different data set identifiers is different;

According to the equalization processing instruction, the first training data is equalized for each of the first data set identified by the data set identifier to obtain second training data, wherein the second Each second data set corresponding to each of the data set identifiers in the training data will not cause the second training data to be unbalanced;

Using the second training data to train a classification model corresponding to the target device.

Optionally, based on the classification model training apparatus shown in FIG. 11, the at least one processor 1102 is further configured to: when executing the executable instruction:

Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in In the same numerical range, the numerical values of the samples in the different third data sets are in different numerical ranges;

If the first training data is clustered to obtain the third data set, determining that the first training data is balanced;

If at least two third data sets are obtained by performing clustering processing on the first training data, then

Determining, from the at least two of the third data sets, a fourth data set including the smallest number of samples and a fifth data set including the largest number of samples, and

If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than a preset proportion threshold, determining that the first training data is not balanced, and

If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced.

For each of the data set identifiers included in the equalization processing instruction, a sixth data set corresponding to the data set identifier is determined, wherein the sample in the sixth data set and the data set identifier identified The numerical values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to The percentage threshold;

Combining samples in each of the sixth data sets corresponding to each of the data set identifiers with the first training data to obtain the second training data.

Collecting at least one sample from the first data set identified by the data set identifier;

The sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.

Receiving a data reselection instruction from the user in response to the interaction request, wherein the data reselection instruction includes a data read address;

Reading third training data from the storage space corresponding to the data reading address according to the data reselection instruction;

The third training data is used as the first training data, and the judgment is performed to determine whether the first training data is balanced.

The present invention also provides a computer-readable medium that stores instructions for making a computer execute the classification model training method as described herein. Specifically, a system or device equipped with a storage medium may be provided, and the software program code for realizing the function of any one of the above embodiments is stored on the storage medium, and the computer (or CPU or MPU of the system or device) ) Read and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the function of any one of the above embodiments, so the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of storage media used to provide program codes include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Magnetic tape, non-volatile memory card and ROM. Alternatively, the program code can be downloaded from the server computer via a communication network.

In addition, it should be clear that not only the program code read by the computer can be executed, but also some or all of the actual operations can be completed by the operating system operating on the computer through instructions based on the program code, so as to realize the above-mentioned embodiments. Function of any one of the embodiments.

In addition, it can be understood that the program code read from the storage medium is written to the memory provided in the expansion board inserted into the computer or to the memory provided in the expansion unit connected to the computer, and then the program code is based on The instructions cause the CPU installed on the expansion board or the expansion unit to perform part or all of the actual operations, so as to realize the function of any one of the foregoing embodiments.

It should be noted that not all steps and modules in the above-mentioned processes and system structure diagrams are necessary, and some steps or modules can be ignored according to actual needs. The order of execution of each step is not fixed and can be adjusted as needed. The system structure described in the foregoing embodiments may be a physical structure or a logical structure. That is, some modules may be implemented by the same physical entity, or some modules may be implemented by multiple physical entities, or may be implemented by multiple physical entities. Some components in independent devices are implemented together.

In the above embodiments, the hardware unit can be implemented mechanically or electrically. For example, a hardware unit may include a permanent dedicated circuit or logic (such as a dedicated processor, FPGA or ASIC) to complete the corresponding operation. The hardware unit may also include programmable logic or circuits (such as general-purpose processors or other programmable processors), which may be temporarily set by software to complete corresponding operations. The specific implementation mode (mechanical method, or dedicated permanent circuit, or temporarily set circuit) can be determined based on cost and time considerations.

The present invention has been shown and described in detail through the drawings and preferred embodiments above. However, the present invention is not limited to these disclosed embodiments. Based on the above multiple embodiments, those skilled in the art can know that the above different embodiments can be combined. The code review method in, obtains more embodiments of the present invention, and these embodiments are also within the protection scope of the present invention.

Claims

The classification model training method is characterized by including:

Acquiring first training data, where the first training data includes historical operating data of the target device;

Judging whether the first training data is balanced;

If the first training data is unbalanced, sending an interaction request to the user, where the interaction request is used to request the user to determine a way to process the first training data;

Receive a user's equalization processing instruction in response to the interaction request, wherein the equalization processing instruction includes at least one data set identifier, and each of the at least one data set identifiers is used to identify the A first data set in the first training data that causes the first training data to be unbalanced, and the first data set identified by different data set identifiers is different;

According to the equalization processing instruction, the first training data is equalized for each of the first data set identified by the data set identifier to obtain second training data, wherein the second Each second data set corresponding to each of the data set identifiers in the training data will not cause the second training data to be unbalanced;

Using the second training data to train a classification model corresponding to the target device.
The method according to claim 1, wherein the determining whether the first training data is balanced comprises:

Perform clustering processing on the first training data to obtain at least one third data set, where each third data set includes at least one sample, and the value of each sample in the same third data set is located in In the same numerical range, the numerical values of the samples in the different third data sets are in different numerical ranges;

If the first training data is clustered to obtain the third data set, determining that the first training data is balanced;

If at least two third data sets are obtained by performing clustering processing on the first training data, then

Determining, from the at least two of the third data sets, a fourth data set including the smallest number of samples and a fifth data set including the largest number of samples, and

If the ratio of the number of samples in the fourth data set to the number of samples in the fifth data set is less than a preset proportion threshold, determining that the first training data is not balanced, and

If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, it is determined that the first training data is balanced.
The method according to claim 2, wherein the first training data is equalized for each of the first data set identified by the data set identifier according to the equalization processing instruction To obtain the second training data, including:

For each of the data set identifiers included in the equalization processing instruction, a sixth data set corresponding to the data set identifier is determined, wherein the sample in the sixth data set and the data set identifier identified The numerical values of the samples in the first data set are in the same numerical range, and the ratio of the total number of samples in the first data set and the sixth data set identified by the data set identifier to the number of samples in the fifth data set is greater than or equal to The percentage threshold;

Combining samples in each of the sixth data sets corresponding to each of the data set identifiers with the first training data to obtain the second training data.
The method according to claim 3, wherein said determining a sixth data set corresponding to the data set identifier comprises:

Collecting at least one sample from the first data set identified by the data set identifier;

The sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.
The method according to any one of claims 1 to 4, characterized in that, after the sending the interaction request to the user, it further comprises:

Receiving a data reselection instruction from the user in response to the interaction request, wherein the data reselection instruction includes a data read address;

Reading third training data from the storage space corresponding to the data reading address according to the data reselection instruction;

The third training data is used as the first training data, and the judgment is performed to determine whether the first training data is balanced.
The classification model training device is characterized in that it includes:

A data acquisition module (701) for acquiring first training data, where the first training data includes historical operating data of the target device;

A data judging module (702) for judging whether the first training data obtained by the data obtaining module (701) is balanced;

A request sending module (703) is used to send an interactive request to the user according to the judgment result of the data judgment module (702) if the first training data is unbalanced, wherein the interactive request is used to request all The user determines the manner of processing the first training data;

An instruction receiving module (704), configured to receive the equalization processing instruction of the interaction request sent by the user in response to the request sending module (703), wherein the equalization processing instruction includes at least one data set identifier, and Each of the at least one data set identifier is used to identify a first data set in the first training data that causes the first training data to be unbalanced, and the data set identifier is different from the one identified by the data set identifier The first data set is different;

A data processing module (705), which is configured to, according to the equalization processing instruction received by the instruction receiving module (704), respectively, for the first data set identified by each data set identifier, pair the Perform equalization processing on the first training data to obtain second training data, wherein each second data set corresponding to each data set identifier in the second training data does not cause the second training data to be unbalanced ；

A model training module (706) is used to train a classification model corresponding to the target device using the second training data acquired by the data processing module (705).
The device according to claim 6, characterized in that the data judgment module (702) comprises:

A clustering processing unit (7021) is configured to perform clustering processing on the first training data to obtain at least one third data set, wherein each third data set includes at least one sample, and the same The numerical value of each sample in the third data set is in the same numerical range, and the numerical value of the sample in the different third data set is in different numerical ranges;

A first judging unit (7022) for determining that the first training data is balanced when the clustering processing unit (7021) performs clustering processing on the first training data to obtain a third data set ；

A second judging unit (7023), configured to obtain at least two third data sets from the at least two third data sets when the clustering processing unit (7021) performs clustering processing on the first training data The third data set is determined to include the fourth data set with the smallest number of samples and the fifth data set with the largest number of samples, if the ratio of the number of samples in the fourth data set to the fifth data set is less than the expected Set the proportion threshold to determine that the first training data is not balanced. If the ratio of the number of samples in the fourth data set to the fifth data set is greater than or equal to the proportion threshold, determine the first training data Data balance.
The device according to claim 7, wherein the data processing module (705) comprises:

A data collection unit (7051) is configured to determine a sixth data set corresponding to the data set identifier for each data set identifier included in the equalization processing instruction, wherein, samples in the sixth data set The value of the sample in the first data set identified by the data set identifier is in the same numerical range, and the total number of samples in the first data set and the sixth data set identified by the data set identifier is the same as the fifth The ratio of the number of samples in the data set is greater than or equal to the proportion threshold;

A data combination unit (7052), configured to combine samples in each of the sixth data sets determined by the data collection unit (7051) with the first training data to obtain the second training data.
The device according to claim 8, wherein:

The data collection unit (7051) is configured to collect at least one sample from the first data set identified by the data set identifier for each data set identifier included in the equalization processing instruction, and The sample set including the at least one collected sample is used as the sixth data set corresponding to the data set identifier.
The device according to any one of claims 6 to 9, further comprising: a data reselection module (707);

The instruction receiving module (704) is further configured to receive a data reselection instruction of the interaction request sent by the user in response to the request sending module (703), wherein the data reselection instruction includes data read address;

The data reselection module (707) is configured to read the third training data from the storage space corresponding to the data read address according to the data reselection instruction received by the instruction receiving module (704) And use the third training data as the first training data, and then trigger the data judging module (702) to execute the judging whether the first training data is balanced.
The classification model training device is characterized by comprising: at least one memory (1101) and at least one processor (1102);

The at least one memory (1101), used for storing machine-readable programs;

The at least one processor (1102) is configured to invoke the machine-readable program to execute the method according to any one of claims 1 to 5.
A computer-readable medium, wherein computer instructions are stored on the computer-readable medium, and when the computer instructions are executed by a processor, the processor executes the method according to any one of claims 1 to 5 .