CN115859099A

CN115859099A - Sample generation method and device, electronic equipment and storage medium

Info

Publication number: CN115859099A
Application number: CN202211469478.XA
Authority: CN
Inventors: 刘琦; 杨博; 张天文; 郑忠斌; 陈彩莲; 何大清; 陈璐; 芦清
Original assignee: Chint Group R & D Center Shanghai Co ltd; Zhejiang Zhengtai Zhiwei Energy Service Co ltd; Shanghai Jiaotong University
Current assignee: Chint Group R & D Center Shanghai Co ltd; Zhejiang Zhengtai Zhiwei Energy Service Co ltd; Shanghai Jiaotong University
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-28

Abstract

The invention discloses a sample generation method, a sample generation device, electronic equipment and a storage medium, wherein sample operation data of at least one photovoltaic module is collected; training the initial model according to the label sample data to obtain an initial fault diagnosis model; inputting the unlabeled sample data to an initial fault diagnosis model for fault detection to obtain a prediction result corresponding to the unlabeled sample data; screening the non-label sample data according to a prediction result corresponding to the non-label sample data to obtain target non-label data; obtaining a sample training set based on target label-free data and label sample data; by combining a small sample technology, the problem of training set data quality in actual operation and maintenance of the photovoltaic power station is solved, the training set data quality of the photovoltaic power station is greatly improved, and the accuracy of a subsequent diagnosis model is improved by sample expansion.

Description

Sample generation method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of fault diagnosis, in particular to a sample generation method, a sample generation device, electronic equipment and a storage medium.

Background

The growing problems of environmental pollution, energy shortage, sustainable development and the like are attracting wide attention. Fossil fuels consumed by power stations are one of the main causes of carbon emissions for most countries in the world. According to studies on greenhouse gas emissions (mainly carbon dioxide), more than 40% of carbon emissions are generated by the combustion of fossil fuels in the process of power generation. As a clean renewable energy source, solar energy is considered as a low-carbon development direction with a wide market prospect, wherein photovoltaic power generation is one of the main approaches to utilizing solar energy. Photovoltaic modules are the core components of photovoltaic systems and are mostly operated under harsh outdoor conditions. Photovoltaic systems face a number of potential common faults in everyday operation. Because different fault types have different degrees of influence on the power generation efficiency, the operation safety and the economic benefit of the photovoltaic system, the fault type can be quickly and accurately diagnosed after the photovoltaic module breaks down, and the method is crucial to maintaining the reliability of the photovoltaic system, continuously generating power and reducing the economic loss of power generation.

In the daily operation process of the current photovoltaic power station, the electrical data monitored by the system are recorded at all times, wherein the electrical data comprise normal samples and fault samples. In order to save the operation and maintenance cost, only a small part of the samples can be labeled by corresponding technicians and experts, and most of the rest samples are unlabeled. This can severely impact the local model per round and the ultimate global accuracy obtained by existing federal learning methods. Therefore, once only a small number of label samples (the rest of the label samples are non-label data) are collected in the sample library of the actual photovoltaic power station, the existing photovoltaic fault diagnosis method cannot completely and accurately extract the features of different faults and distinguish the different faults, and the accuracy of the final diagnosis model is greatly reduced.

Disclosure of Invention

The embodiment of the invention provides a sample generation method and device, electronic equipment and a storage medium, and aims to solve the problem that a sample for existing photovoltaic fault diagnosis is limited.

In one aspect, an embodiment of the present invention provides a sample generation method, where the method includes:

collecting sample operation data of at least one photovoltaic module, wherein the sample operation data comprises labeled sample data with a label and unlabeled sample data without the label;

training an initial model according to the label sample data to obtain an initial fault diagnosis model;

inputting the unlabeled sample data into the initial fault diagnosis model for fault detection to obtain a prediction result corresponding to the unlabeled sample data;

screening the non-tag sample data according to a prediction result corresponding to the non-tag sample data to obtain target non-tag data;

and obtaining a sample training set based on the target non-label data and the label sample data.

In another aspect, an embodiment of the present invention provides a sample generation apparatus, where the apparatus includes:

the system comprises a collecting module, a judging module and a judging module, wherein the collecting module is used for collecting sample operation data of at least one photovoltaic module, and the sample operation data comprises label sample data with a label and non-label sample data without a label;

the training module is used for training the initial model according to the label sample data to obtain an initial fault diagnosis model;

the prediction module is used for inputting the unlabeled sample data into the initial fault diagnosis model for fault detection to obtain a prediction result corresponding to the unlabeled sample data;

the screening module is used for screening the unlabeled sample data according to the prediction result corresponding to the unlabeled sample data to obtain target unlabeled data;

and the sample module is used for obtaining a sample training set based on the target label-free data and the label sample data.

In another aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample generation method.

In another aspect, an embodiment of the present invention provides a storage medium storing a plurality of instructions for causing a computer to execute the sample generation method.

The method comprises the steps of collecting sample operation data of at least one photovoltaic module, wherein the sample operation data comprises labeled sample data with a label and unlabeled sample data without the label; training the initial model according to the label sample data to obtain an initial fault diagnosis model; inputting the unlabeled sample data to an initial fault diagnosis model for fault detection to obtain a prediction result corresponding to the unlabeled sample data; screening the non-label sample data according to a prediction result corresponding to the non-label sample data to obtain target non-label data; obtaining a sample training set based on target label-free data and label sample data; by combining a small sample technology, the problem of training set data quality in actual operation and maintenance of the photovoltaic power station is solved, the training set data quality of the photovoltaic power station is greatly improved, and the accuracy of a subsequent diagnosis model is improved by sample expansion.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a sample generation method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of an operational data collection scenario for a photovoltaic module provided by an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an initial model provided by an embodiment of the present invention;

FIG. 4 is a flow chart illustrating a method for obtaining a joint diagnosis model based on federated learning according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a sample generation apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As described in the background, in the field of photovoltaic fault diagnosis, existing artificial intelligence-based methods can be mainly classified into three categories: the first type is that under the ideal condition that the types and the quantity of fault samples are sufficient, the accuracy of fault diagnosis is improved based on a series of deep neural network methods; the second type is to research the fault diagnosis technology under the condition that the number of fault label samples is small or the types of the samples are limited, and mainly expands the number of the samples or jointly models based on methods such as semi-supervised learning and federal learning, so that the residual unmarked data are fully utilized or the generalization capability of the model is improved, and the fault diagnosis precision of the model is improved; and the third category is that from the aspect of algorithm execution efficiency, model training cost in the building process of the photovoltaic fault diagnosis model is reduced based on methods such as transfer learning, and therefore modeling efficiency is improved.

The first and third fault diagnosis methods are both established under ideal training data conditions. However, because manually collecting and labeling the fault data of the photovoltaic module consumes a lot of manpower and material resources, it is often difficult to establish a complete photovoltaic fault database in the operation and maintenance system of an actual single photovoltaic power station. Therefore, the second type of fault diagnosis method is more suitable for practical application by those skilled in the art.

The existing fault diagnosis methods based on semi-supervised learning, federal learning and the like in the photovoltaic fault diagnosis method only separately consider the problem of sample expansion when the number of initial training samples is insufficient and the problem of model generalization improvement when the types of the samples are limited. However, these methods have significant disadvantages and cannot meet the requirements of practical applications. In particular, for the existing semi-supervised learning methods, the problems of limited label sample types and unbalanced samples in the initial training set are ignored. Wherein, the label sample type is limited, which means that the fault database of the photovoltaic power station only contains a small part of fault types. Due to the fact that the photovoltaic power stations are different in geographical positions, the old and new degree of system equipment, meteorological conditions and other aspects, only a part of fault type samples can be collected by a single photovoltaic power station, the fault types which can be identified by the model are limited due to the low-quality data condition, and the generalization and accuracy of the model are greatly reduced. Label sample imbalance means that there is a significant difference in the number of samples for different fault types. In fact, to the person skilled in the art, this situation is more consistent with the actual operating conditions of the photovoltaic plant. The photovoltaic power station is normally operated most of the time, so that the normal state samples are far more than the fault state samples. Due to the environment and the system, the occurrence frequency of various common faults is different, and the number of samples of different fault types is greatly different. This can result in over-fitting of the existing model to the normal state and under-fitting of the sample of the fault state, thereby greatly reducing the accuracy of the model. In addition, the existing method does not consider the distribution difference between the labeled data set and the unlabeled data set, and when the distribution of the labeled data set and the unlabeled data set is greatly different (even opposite), the accuracy of the fault diagnosis model obtained by the existing method is greatly reduced.

Based on the above, in order to solve the problem that the existing photovoltaic fault diagnosis samples are limited, the embodiment of the invention provides a sample generation method, so as to solve the problems that the number of the label samples in the sample library of the photovoltaic power station is insufficient and the label data set and the non-label data set have obvious distribution difference.

As shown in fig. 1, fig. 1 is a schematic flow chart of a sample generation method provided by an embodiment of the present invention, where the sample generation method is applied to an electronic device, and the electronic device is deployed in a photovoltaic power station.

It should be noted that, in the embodiment of the present invention, at least one photovoltaic power station exists, each photovoltaic power station is deployed with an electronic device, and the electronic device deployed by each photovoltaic power station can acquire a sample training set of the photovoltaic power station according to steps 101 to 105. In some embodiments of the present invention, the electronic device may be a computer, an industrial computer, or the like. Specifically, the sample generation method shown in fig. 1 includes steps 101 to 105:

and 101, collecting sample operation data of at least one photovoltaic module.

The sample operation data comprises labeled sample data with a label and unlabeled sample data without a label. The label represents a real result of the sample operation data, wherein the real result comprises whether the sample operation data has a fault or not and a corresponding fault type when the fault exists. The fault types include, but are not limited to, aging, short circuit, open circuit, occlusion, etc.

In some embodiments of the invention, the sample operating data may be historical operating data of the current photovoltaic power plant over a past period of time, the historical operating data including a current-voltage characteristic of the photovoltaic component over the past period of time and corresponding temperatures and irradiance. In some embodiments of the present invention, the past period of time is not particularly limited, for example, the past period of time may be a past week, a past month, or a past year.

In some embodiments of the invention, the electronic device deployed in each photovoltaic power station can respectively acquire the operation data of each photovoltaic module through the volt-ampere tester and the environment tester equipped with the bluetooth communication function, store the operation data in the database of the electronic device, and access the database to acquire historical operation data in a past period of time to obtain sample operation data. For example, as shown in fig. 2, fig. 2 is a schematic diagram of an operation data acquisition scenario of photovoltaic modules provided in an embodiment of the present invention, each photovoltaic power plant includes a plurality of photovoltaic modules, and a volt-ampere characteristic curve of a photovoltaic array and corresponding temperature and irradiance are respectively acquired by using a volt-ampere tester and an environment tester equipped with a bluetooth communication function.

In some embodiments of the present invention, in the operation data acquisition of the photovoltaic module, in order to solve the operation and maintenance cost of the photovoltaic power station, only part of the acquired operation data may be marked with a real result, and the rest of the operation data in the operation data are unmarked and unlabeled data, so that the sample operation data includes two types of operation data: labeled sample data with a label and unlabeled sample data without a label.

And 102, training the initial model according to the label sample data to obtain an initial fault diagnosis model.

In some embodiments of the invention, the initial model may be a machine learning model, such as a model based on a logistic regression algorithm, decision tree, support vector machine, k nearest neighbor, naive bayes, random forests; the initial model may also be a Neural network model, such as Convolutional Neural Network (CNN) -based, deconvolution Neural network (De-Convolutional, DN) -based, deep Neural Network (DNN), deep Convolutional Inverse Graph Network (DCIGN) -based, region-based Convolutional network (RCNN), region-based fast Convolutional network (fast RCNN), and Bidirectional Encoder/decoder (BERT) based models; it is understood that the initial fault diagnosis model may be a machine learning model, or a neural network model.

In some embodiments of the present invention, step 102 comprises: the initial model can be obtained, the label sample data is input into the initial model, the training recognition result corresponding to the label sample data is obtained, the training loss value is determined according to the training recognition result corresponding to the label sample data and the real result corresponding to the label sample data through a preset loss function, the model parameters of the initial model are adjusted according to the training loss value until the initial model meets the preset convergence condition, and the initial fault diagnosis model is obtained. The preset convergence condition may be that the training loss value is less than or equal to a preset loss threshold, or that the training times of the initial model are greater than or equal to a preset time threshold.

In some embodiments of the invention, in consideration of the fact that the number of the label sample data in the sample operation data is small, the generalization capability of the initial fault diagnosis model obtained by training using the label sample data is poor, and the accuracy of the subsequent fault diagnosis result of the photovoltaic module is low.

And 103, inputting the unlabeled sample data to the initial fault diagnosis model for fault detection to obtain a prediction result corresponding to the unlabeled sample data.

The prediction result comprises whether the unlabeled sample data has a fault or not and the fault type when the fault exists.

And 104, screening the non-label sample data according to the prediction result corresponding to the non-label sample data to obtain target non-label data.

In some embodiments of the present invention, step 104 comprises: according to the prediction result corresponding to the non-label sample data, selecting first pseudo-label sample data with faults in the non-label sample data, selecting second pseudo-label sample data corresponding to each fault type from the first pseudo-label sample data with faults, and obtaining target non-label data according to the second pseudo-label sample data corresponding to each fault type selected from the non-label sample data.

In some embodiments of the present invention, the second pseudo tag sample data corresponding to each fault type selected from the non-tag sample data may be set as the target non-tag data. In some embodiments of the present invention, the second pseudo tag sample data corresponding to each fault type selected from the preset number of non-tag sample data may be set as the target non-tag data.

And 105, obtaining a sample training set based on the target non-label data and the label sample data.

In some embodiments of the present invention, step 105 comprises: setting target label-free data as predicted label data, and collecting the predicted label data and label sample data to obtain a sample training set.

In some embodiments of the present invention, step 105 comprises: setting target unlabeled data as predicted labeled data, obtaining an initial sample training set according to a set of the predicted labeled data and labeled sample data, training an initial fault diagnosis model according to the initial sample training set to obtain a fault diagnosis model, performing fault prediction on the unlabeled sample data according to the fault diagnosis model to obtain a new prediction result corresponding to the unlabeled sample data, obtaining new target unlabeled data according to step 104 based on the new prediction result corresponding to the unlabeled sample data, and obtaining a sample training set according to a set consisting of the new target unlabeled data and the labeled sample data.

According to the sample generation method provided by the embodiment of the invention, the target unlabeled data expansion label sample data screened from the unlabeled sample data is obtained through the prediction result according to the prediction result corresponding to the unlabeled sample data obtained by the initial fault diagnosis model, so that the problem of insufficient label sample quantity of the sample library of the photovoltaic power station is solved, the data quality of the training set of the photovoltaic power station is greatly improved, and the accuracy of the fault diagnosis model obtained based on the training of the sample training set is further ensured.

In some embodiments of the present invention, the volt-ampere characteristic curve of the photovoltaic power plant and the corresponding temperature and irradiance may be acquired according to the acquisition method shown in fig. 2, so as to obtain sample operation data of the photovoltaic module.

In some embodiments of the present invention, in consideration of the situations that the acquired volt-ampere characteristic curve of a photovoltaic power station and the corresponding temperature and irradiance have inconsistent data dimensions, the volt-ampere characteristic curve data is missing, the current and voltage sampling points are not uniformly distributed, and data redundancy exists, the classified data can be respectively constructed as sample operation data by further data preprocessing, so as to ensure the fault identification accuracy of an identification model obtained by training based on a sample training set, specifically, the method for obtaining the sample operation data includes steps a1 to a6:

step a1, acquiring a volt-ampere characteristic curve of a photovoltaic power station and corresponding temperature and irradiance to obtain initial operation data.

In some embodiments of the invention, the volt-ampere characteristic curve and the corresponding temperature and irradiance of the photovoltaic power plant may be collected at preset time intervals.

In some embodiments of the invention, the voltage-current characteristic curve and the corresponding temperature and irradiance of the photovoltaic power plant over a historical period of time may be obtained from an operational database.

Step a2, detecting a missing value according to initial operation data to obtain a missing degree; if the missing degree is larger than the preset degree threshold value, performing data filling, and executing the step a3 after the data filling; and if the missing degree is less than or equal to the preset degree threshold value, executing the step a3.

In some embodiments of the present invention, the missing value detection may be performed by comparing the voltammetry curves with a pre-stored reference voltammetry curve.

In some embodiments of the invention, data stuffing may be performed by interpolation; data filling may also be performed by counting the mean, median or mode of the voltammetric signature. And common data filling methods such as average filling, hot card filling, K neighbor average and the like can be further used.

And a3, resampling the volt-ampere characteristic curve to obtain a preset number of volt-ampere data points.

In some embodiments of the invention, the open-circuit voltage Vo and the short-circuit current Isc can be recorded according to the volt-ampere characteristic curve, data is down-sampled, 20 voltages VRx are re-sampled at equal intervals in the range of [0, vo ], and 20 currents IRx are re-sampled at equal intervals in the range of [0, isc ]; completing the data, and calculating voltage values corresponding to 20 resampling current IRx positions and current values corresponding to 20 resampling voltage VRx positions; and acquiring a re-sampled volt-ampere characteristic curve, arranging the obtained 40 re-sampling points according to a voltage descending order to obtain a 40 x 2 array, and setting the obtained 40 x 2 array as a volt-ampere data point. The volt-ampere data points obtained after resampling only contain 40 sampling points, but still can well reflect the fault characteristic information implied by the curve, and meanwhile, data redundancy is avoided, and the calculation cost is saved.

And a4, carrying out data reconstruction based on the volt-ampere data points, the temperature corresponding to the volt-ampere characteristic curve and the irradiance to obtain an operation array.

In some embodiments of the present invention, since temperature and irradiance have a significant impact on the operating conditions of the photovoltaic assembly, it is desirable to further combine these two environmental information when analyzing a particular fault type. For this purpose, a data reconstruction method is adopted, corresponding temperature and irradiance are inserted into the volt-ampere data pair of 40 × 2 as an environment vector of 40 × 2, and a running array of 40 × 4 is reconstructed to serve as one sample data of photovoltaic fault diagnosis.

And a5, determining initial operation data with a real identification result in the initial operation data, setting the real identification result as a label, and adding the label to an operation array corresponding to the initial operation data with the real identification result.

And a6, setting the operation array with the label as label sample data according to the operation data with the label, and setting the operation array without the label as non-label sample data to obtain sample operation data.

In some embodiments of the present invention, after the sample operation data is obtained, the label sample data may be input to the initial model for training according to step 102, so as to obtain an initial fault diagnosis model.

In some embodiments of the present invention, the tag sample data in the sample operation data is a 40 × 4 two-dimensional array, and considering that the computation power of an actual photovoltaic power station is limited, a complex deep neural network may not be used, and based on this, an initial model is built based on a convolutional neural network in an embodiment of the present invention, for example, as shown in fig. 3, fig. 3 is a schematic structural diagram of the initial model provided in the embodiment of the present invention, and the initial model shown includes a two-dimensional convolutional layer, a dimensionality reduction layer, a one-dimensional convolutional layer, a maximal pooling layer, a full connection layer, and a linear classification layer.

Specifically, inputting label sample data into a two-dimensional convolution layer of an initial model for convolution operation, and performing primary feature extraction; inputting the output of the two-dimensional convolution layer to a dimensionality reduction layer for data dimensionality compression; inputting the output of the dimensionality reduction layer to the one-dimensional convolution layer to mine the fault characteristic information hidden in the data; the output of the one-dimensional convolution layer is input to the maximum pooling layer; inputting the output of the maximum pooling layer to a full-connection layer to obtain a probability vector of sample fault type prediction; and inputting the output of the full connection layer to the linear classification layer for fault prediction, and outputting a training identification result of the label sample data.

The one-dimensional convolutional layer comprises a first convolutional unit, a second convolutional unit, a first pooling unit, a third convolutional unit, a second pooling unit and a fourth convolutional unit which are sequentially connected; the linear classification layer includes a fully connected classifier and a Softmax function.

In some embodiments of the invention, the predetermined loss function may be a cross-entropy loss function.

In some embodiments of the present invention, a problem of type imbalance of sample operation data is considered, and therefore, in the embodiments of the present invention, different weights are assigned to different types of sample data to establish a preset loss function

Wherein N represents the type number of the operation state of the photovoltaic module and is based on the value of the operational state of the photovoltaic module>

Probability vector, gamma, characterizing the type of prediction _n Is a regulatory factor, alpha _n For controlling the weights of different types of sample data. In some embodiments of the present invention, a corresponding adjustment factor may be set according to the number of samples of each fault type in the tag sample data.

In some embodiments of the invention, the sample types have a higher y due to accurate classification _n Value, thus setting its corresponding γ _n Approaching 0. On the contrary, for sample types with inaccurate classification, the corresponding gamma is set _n Approaching 1. Alpha is alpha _n Is a predefined constant value between 0 and 1 for balancing different types of sample data. The preset loss function provided by the embodiment of the invention increases the weight of the sample type (the type with less label sample data in the label sample data) which is difficult to classify in the preset loss function, which means that the preset loss function can pay more attention to the sample type which is difficult to classify, so that the phenomenon that the neural network model deviates from the optimal parameter due to unbalanced sample operation data in the training process can be avoided to a certain extent, and the accuracy of the model is favorably improved.

In some embodiments of the invention, after an initial model is trained based on tag sample data to obtain an initial fault diagnosis model, considering that the initial fault diagnosis model is obtained by training based on tag sample data and has poor generalization, if the initial fault diagnosis model is used as a diagnosis model to perform fault diagnosis, the fault diagnosis result may be inaccurate, and in order to overcome the problems of overfitting and under-fitting of the model, initially limited tag sample data is fully expanded. In some embodiments of the present invention, the prediction result represents whether there is a fault in the unlabeled sample data and a corresponding fault type when there is a fault.

Considering that the photovoltaic power station is operating normally most of the time, this results in far more samples in normal state than in fault state. In addition, due to the environment and the system, the occurrence frequency of various common faults is different, and the number of samples of different fault types is greatly different, so that the initial different types of label sample data have obvious imbalance, and the label sample data and the non-label sample data may have obvious distribution difference. The existing method has high requirement on the data quality of an initial label training set, and under the condition of low-quality data, a pseudo label cannot be added to label-free sample data in an accurately predictable manner, so that a large number of error labeled samples exist in an expanded sample training set. Meanwhile, in the prior art, a removing mechanism of wrong pseudo label samples is not considered, so that the accuracy of the final fault diagnosis model can be greatly reduced by the wrong samples. In order to overcome the defects of the existing method and ensure the accuracy of the data in the final sample training set, after the prediction result corresponding to the unlabeled sample data is obtained, screening can be performed according to the prediction result corresponding to the unlabeled sample data, and the target unlabeled data is obtained by selecting the prediction result from the unlabeled sample data to meet the requirement. Specifically, the method for selecting the target non-label data comprises the following steps b 1-b 2:

step b1, determining a confidence coefficient threshold value corresponding to each fault type according to a label in label sample data; and the label in the label sample data is a fault type label of the label sample data.

In some embodiments of the present invention, a fault type of the tag sample data may be determined according to a tag in the tag sample data, preset threshold data may be queried, and a confidence threshold corresponding to each fault type may be obtained, where the threshold data includes multiple fault types and a confidence threshold corresponding to each fault type.

In some embodiments of the present invention, a label sample data volume corresponding to each fault type in the label sample data may be counted according to a label in the label sample data, and mapping data between a pre-stored proportion and a threshold value may be queried according to a proportion of the label sample data volume corresponding to each fault type in a total data volume of the label sample data, so as to obtain a confidence threshold value corresponding to each fault type. The mapping data between the occupation ratios and the threshold values comprises numerical value intervals of various occupation ratios and confidence threshold values corresponding to the numerical value intervals.

In consideration of screening the false tags based on the uniform confidence threshold, when there is a significant sample imbalance in the initial tag training set, if the uniform threshold is set too high, a valid fault sample cannot be screened from the non-tag data, and if the uniform threshold is set too low, a part of the fault sample is wrongly identified as a normal sample. This means that a small number of types are difficult to predict in the initial stage, resulting in a large number of falsely labeled pseudo label samples being added to the training set during semi-supervised learning, thereby reducing the accuracy of the model. Based on this, the confidence threshold of the pseudo tag is set for different fault types respectively, and a higher confidence threshold is set for the fault types with a large number in the tag sample data. On the contrary, for the fault types with small quantity in the label sample data, a lower confidence threshold value is set for the fault types, so that the pseudo label samples with high confidence can be accurately screened out from a large quantity of unmarked label-free sample data on the premise of considering the screening accuracy and the screening efficiency of the pseudo label samples, the problem of network training fitting when the unbalanced data is processed by the conventional method is solved, and the samples of various fault types in the label-free sample data can be accurately identified. Specifically, the method for determining the confidence threshold includes:

(1) And determining the type number of the fault types included in the obtained label sample data and the label sample data volume corresponding to each fault type according to the label in the label sample data.

(2) And determining a confidence coefficient threshold value corresponding to each fault type according to the type number of the fault types, the label sample data size corresponding to each fault type and the theoretical sample data size corresponding to each fault type.

In some embodiments of the present invention, according to the number N of types of the fault types, the sample data size of the label corresponding to each fault type

And the theoretical sample data quantity theta corresponding to each fault type is based on ^>

And calculating to obtain a confidence threshold corresponding to each fault type. Wherein it is present>

And (4) representing the confidence threshold of the type i of the pseudo label in the t-th semi-supervised learning. It should be noted that, the number of semi-supervised learning rounds in the above calculation formula of the confidence threshold is 20, which is only an exemplary illustration, and the number of semi-supervised learning training rounds may be set according to an actual application scenario.

And b2, performing data screening on the non-label sample data according to the confidence coefficient threshold value corresponding to each fault type and the prediction result corresponding to the non-label sample data, and determining to obtain target non-label data. And the prediction result represents the confidence degree that the unlabeled sample data belongs to each fault type.

In some embodiments of the invention, step b2 comprises: determining the confidence corresponding to each fault type corresponding to the non-label sample data according to the prediction result corresponding to the non-label sample data, and comparing the confidence corresponding to each fault type corresponding to the non-label sample data with the confidence threshold corresponding to the fault type; if the confidence corresponding to each fault type is smaller than the confidence threshold corresponding to the fault type, determining that the unlabeled sample data is a normal sample; if a fault type with the confidence coefficient larger than or equal to the confidence coefficient threshold value corresponding to the fault type exists, determining no-label sample data as a fault sample, and setting the fault type as a pseudo label; and eliminating normal samples in the non-label sample data, setting the fault samples as target non-label data, and determining a pseudo label corresponding to the target non-label data.

In some embodiments of the present invention, after target unlabeled data is obtained, a pseudo label of the target unlabeled data may be added to label sample data, the label sample data is expanded, an initial fault diagnosis model is trained through the expanded label sample data to obtain a new initial fault diagnosis model, fault detection is performed on the remaining unlabeled sample data except the target unlabeled data in the unlabeled sample data according to the new initial fault diagnosis model to obtain a prediction result corresponding to the remaining unlabeled sample data, new target unlabeled data is screened out from the remaining unlabeled sample data according to steps b1 to b2, the new target unlabeled data is added to the expanded label sample data to obtain newly expanded label sample data, the new initial fault diagnosis model is trained according to the newly expanded sample data, so that repetition is performed, multiple rounds of sample expansion are performed, when each round of sample expansion is performed, the number of historical training rounds is compared with a preset round number threshold, and when the number of historical training rounds is greater than or equal to the preset round number of sample expansion, the label after the previous round of sample expansion is set as a round of sample training, and the initial fault diagnosis model is set as a fault diagnosis sample data of the previous round of training; when the number of historical training rounds is smaller than a preset round number threshold value, training an initial fault diagnosis model obtained by previous training according to label sample data obtained by previous expansion to obtain a current round initial fault diagnosis model, performing fault detection on the residual label-free sample data obtained after previous round screening according to the current round initial fault diagnosis model to obtain current round target label-free sample data, performing sample expansion on the previous round expanded label sample data again based on the current round target label-free sample data to obtain current round expanded label sample data, obtaining the number of historical training rounds, obtaining the number of current rounds through the number of historical training rounds plus 1, and when the number of rounds is larger than or equal to the preset round number threshold value, setting the current round expanded label sample data as a sample training set and setting the current round initial fault diagnosis model as a fault diagnosis model; and when the number of turns is smaller than the preset number of turns threshold, continuing to expand the sample according to the sample expansion method. For example, when the number of rounds is 20 and the number of historical training rounds is 19, because the number of historical training rounds is less than the preset number of rounds threshold, training the initial fault diagnosis model obtained by the previous round of training according to the sample data of the label after the previous round of expansion to obtain the initial fault diagnosis model of the current round, performing fault detection on the remaining sample data of the label-free sample after the previous round of screening according to the initial fault diagnosis model of the current round to obtain sample data of the label-free target of the current round, performing sample expansion on the sample data of the label after the previous round of expansion again based on the sample data of the label-free target of the current round to obtain sample data of the label after the current round of expansion, and obtaining the number of historical training rounds, so as to obtain the number of rounds 20 of the current round through the historical training, because the number of rounds is equal to the preset number of rounds threshold, setting the sample data of the label after the current round of expansion as a sample training set, and setting the initial fault diagnosis model of the current round as the fault diagnosis model.

In some embodiments of the present invention, in consideration of the fact that data imbalance exists in the label sample data used for training the initial fault diagnosis model, which may cause uneven recognition accuracy of the model for various fault types, to solve this problem, in an embodiment of the present invention, according to the label sample data amount corresponding to each fault type in the label sample data, the number of the corresponding fault type that needs to be selected from the target unlabeled sample data is obtained, so as to obtain a sample training set, specifically, the method for determining the sample training set includes steps c1 to c5:

and step c1, determining the label sample data amount corresponding to each fault type in the label sample data and the total sample data amount of the label sample data according to the label in the label sample data. And the label in the label sample data is a fault type label of the label sample data.

And c2, determining to obtain the sample expansion rate corresponding to each fault type according to the label sample data size corresponding to each fault type, the total sample amount of the label sample data and a preset proportional factor.

The sample expansion rate characterizes the sample data volume that each fault type in the tag sample data needs to be increased.

In some embodiments of the invention, step c2 comprises: determining the label of each fault type according to the label sample data amount corresponding to each fault type and the total sample amount of the label sample dataAnd performing exponential operation on the sample data volume in the total number of the samples according to the ratio and a preset scale factor to obtain the sample expansion rate of each fault type. For example, according to the sample amount of the label sample data corresponding to each fault type, the total sample amount of the label sample data and a preset scale factor, the method comprises the steps of

And calculating to obtain the sample expansion rate corresponding to each fault type. Wherein the content of the first and second substances, t represents the number of semi-supervised learning rounds,. Sup.>

Represents the number of samples with the label type i in the training set in the tth round of semi-supervised learning, and is/are>

Represents the total number of all types of samples in the training set, beta represents an expansion scaling factor, and ` is `>

And (4) representing the sample expansion rate of the type i in the t-th semi-supervised learning.

In some embodiments of the present invention, a ratio of the tag sample data size of each fault type in the total sample number may also be determined according to the tag sample data size corresponding to each fault type and the total sample number of the tag sample data, and pre-stored sample expansion data may be queried to obtain a sample expansion rate corresponding to the ratio of the tag sample data size of each fault type in the total sample number, where the sample expansion data includes a plurality of ratio values and a sample expansion rate corresponding to each ratio value.

In some embodiments of the present invention, the sample amount of the label sample data, the total sample amount of the label sample data, and the preset scale factor corresponding to each fault type may also be input into the preset expansion rate calculation model, so as to obtain the sample expansion rate corresponding to each fault type. The preset expansion rate calculation model can be a machine learning model, a neural network model or a probability calculation model.

And c3, determining to obtain a pseudo label of the target non-label data according to the prediction result of the target non-label data. Wherein, the pseudo label is a fault type label predicted by target unlabeled data.

In some embodiments of the present invention, the failure type of the target unlabeled data is determined according to the prediction result of the target unlabeled data, and the failure type of the target unlabeled data is set as the pseudo label of the target unlabeled data.

And c4, selecting the prediction label data corresponding to each fault type from the target non-label data according to the sample expansion rate corresponding to each fault type and the pseudo label of the target non-label data.

In some embodiments of the present invention, according to a pseudo tag of target unlabeled data, statistics are performed on fault types existing in the target unlabeled data and a number of samples corresponding to each fault type, according to the number of samples corresponding to each fault type in the target unlabeled data and a sample expansion rate corresponding to each fault type, a target sample data size that needs to be extracted for each fault type in the target unlabeled data is obtained through calculation, a pseudo tag sample of the target sample data size that needs to be extracted for each fault type is selected from the target unlabeled data, and the selected pseudo tag sample of the target sample data size that needs to be extracted for each fault type is set as predicted tag data corresponding to the fault type.

And c5, obtaining a sample training set according to the predicted label data, the pseudo label of the predicted label data and the label sample data.

In some embodiments of the present invention, the predicted tag data may be labeled according to a pseudo tag of the predicted tag data, and the labeled predicted tag data is added to tag sample data to obtain a sample training set.

In some embodiments of the present invention, in order to improve the accuracy of the sample labels in the sample training set, multiple rounds of sample expansion may be performed, when each round of sample expansion is performed, an initial fault diagnosis model obtained by a previous round of training is trained according to sample data of a label after the previous round of sample expansion, so as to obtain an initial fault diagnosis model of a current round, fault detection is performed on remaining sample data of no label after the previous round of sample data screening according to the initial fault diagnosis model of the current round, so as to obtain sample data of no label of a target of the current round, an expansion rate of the sample of the current round of each fault type is determined based on the total amount of the sample data of the label after the previous round of sample expansion, the fault type, and the sample data amount of the label corresponding to each fault type, selecting current wheel prediction tag data from current wheel target non-tag sample data according to the current wheel sample expansion rate of each fault type and the prediction result of the current wheel target non-tag sample data, adding the current wheel prediction tag data to the previous wheel extended tag sample data for sample expansion to obtain the current wheel extended tag sample data, training the current wheel initial fault diagnosis model through the current wheel extended tag sample data to obtain a next wheel initial fault diagnosis model, repeating the steps until the number of wheels is greater than or equal to a preset number of wheels threshold, setting the expanded tag sample data of the wheel when the number of wheels is greater than or equal to the preset number of wheels threshold as a sample training set, and setting the wheel initial fault diagnosis model as the fault diagnosis model when the number of wheels is greater than or equal to the preset number of wheels threshold. The number of rounds refers to the number of sample expansion times in the sample expansion. For example, when the preset round number threshold is 20, when each round of sample expansion is performed, comparing the historical round number with the preset round number threshold, if the historical round number is greater than or equal to the preset round number threshold, setting the sample data of the label after the previous round of expansion as a sample training set, and setting the initial fault diagnosis model obtained by the previous round of training as a fault diagnosis model; if the historical round number is smaller than the preset round number threshold value, carrying out sample expansion according to the sample expansion method to obtain label sample data after the current round of expansion and an initial fault diagnosis model of the current round; after each round of sample expansion, comparing the number of the current round with a preset number of rounds threshold, if the number of the current round is greater than or equal to the preset number of rounds threshold, stopping sample expansion, setting the sample data of the label after the current round expansion as a sample training set, and setting the initial fault diagnosis model of the current round as a fault diagnosis model; and if the number of the current round is smaller than the preset number of rounds threshold, continuing to expand the next round of samples. Specifically, the sample expansion method comprises the following steps:

(1) And obtaining an initial sample training set based on the target label-free data and the label sample data.

In some embodiments of the present invention, an initial sample training set may be obtained according to steps c1 to c5 based on target non-label data and label sample data.

(2) And training the initial fault diagnosis model according to the initial sample training set to obtain an intermediate fault diagnosis model.

(3) And inputting the residual unlabeled sample data except the target unlabeled data in the unlabeled sample data into the intermediate fault diagnosis model to obtain a prediction result corresponding to the residual unlabeled sample data.

(4) And screening the residual unlabeled sample data according to a prediction result corresponding to the residual unlabeled sample data to obtain the residual target unlabeled data.

In some embodiments of the present invention, the remaining target unlabeled data may be screened from the remaining unlabeled sample data according to steps b1 to b 2.

(5) And obtaining a sample training set based on the residual target unlabeled data and the initial sample training set.

In some embodiments of the present invention, based on the remaining target unlabeled data and the initial sample training set, a new initial sample training set may be obtained repeatedly according to steps c1 to c5, a new round of sample expansion is performed according to the sample expansion method based on the new initial sample training set, and the number of sample expansion rounds is recorded, and when the number of sample expansion rounds is greater than or equal to a preset round number threshold, the currently obtained new initial sample training set is set as the sample training set, and the sample expansion is stopped.

In some embodiments of the present invention, in consideration of the recognition accuracy of the initial fault diagnosis model during the sample expansion process, erroneous predicted tag data may be added to the tag sample data to affect the accuracy after the model training, so that the predicted tag data needs to be screened again during the sample expansion, and the accuracy of the finally obtained tags in the sample training set is further ensured, specifically, the method includes steps d1 to d3:

and d1, acquiring the number of training rounds of the intermediate fault diagnosis model.

And d2, if the training round number of the intermediate fault diagnosis model meets a preset round number threshold value, obtaining a sample training set according to the residual target unlabeled data and the initial sample training set.

In some embodiments of the present invention, if the number of training rounds is greater than or equal to the preset round number threshold, it is determined that the number of training rounds of the intermediate fault diagnosis model satisfies the preset round number threshold, and sample expansion is performed according to the remaining target unlabeled data and the initial sample training set in steps c1 to c5 to obtain a sample training set.

And d3, if the number of training rounds of the intermediate fault diagnosis model does not meet the preset round number threshold value, determining whether the number of training rounds of the intermediate fault diagnosis model meets the preset round number interval, and determining a sample training set according to the determination result of whether the number of training rounds of the intermediate fault diagnosis model meets the preset round number interval.

In some embodiments of the present invention, if the number of training rounds is less than the preset round number threshold, it is determined that the number of training rounds of the intermediate fault diagnosis model does not satisfy the preset round number threshold.

In some embodiments of the present invention, in order to remove the error sample in time, when the number of training rounds of the intermediate fault diagnosis model does not satisfy the preset round number threshold, fault detection may be performed on the unlabeled sample data according to the initial fault diagnosis model obtained by the current round of training at every preset round number interval, the predicted labeled data is re-screened based on the test result of the unlabeled sample data, the error sample in the unlabeled sample data is removed, and the labeling accuracy in the training sample set is improved by identifying the initial fault diagnosis model with better accuracy. The preset round interval may be an interval between the training round interval of the first sample expansion and the training round interval of the previous sample failure detection, or may be an interval between the current training round interval and the historical round interval of the previous sample failure detection, for example, when the preset round interval is an interval between the current training round interval and the historical round interval of the previous sample failure detection, when the preset round threshold is 20 and the preset round interval is 4, starting from the first round training, performing the failure detection on the sample data without a label according to the initial failure diagnosis model obtained by the current round training, re-screening the predicted label data based on the test result of the sample data without a label, that is, every 4 rounds, deleting the initial sample training set obtained by the 4 rounds of expansion, performing the failure detection on the sample data without a label according to the initial failure diagnosis model obtained by the current round training, and obtaining a new initial sample training set, so as to eliminate the error sample data in the sample data without a label by identifying the initial failure diagnosis model with better accuracy, and improve the accuracy of the final sample set obtained by the training. Specifically, the sample secondary screening method comprises the following steps:

(1) If the training round number of the intermediate fault diagnosis model meets the preset round number interval, inputting unlabeled sample data into the intermediate fault diagnosis model to obtain second target unlabeled data, and obtaining a new initial sample training set based on the second target unlabeled data and the labeled sample data; and training the intermediate fault diagnosis model according to the new initial sample training set until the training round number of the intermediate fault diagnosis model is greater than or equal to a preset round number threshold value, and obtaining a sample training set.

(2) If the number of training rounds of the intermediate fault diagnosis model does not meet the preset round interval, updating the initial sample training set based on the residual target unlabeled data to obtain a new initial sample training set, training the intermediate fault diagnosis model according to the new initial sample training set, and inputting unlabeled sample data into the intermediate fault diagnosis model meeting the preset round interval when the number of training rounds of the intermediate fault diagnosis model meets the preset round interval until the number of training rounds of the intermediate fault diagnosis model is larger than or equal to the preset round threshold to obtain the sample training set.

In some embodiments of the present invention, a round number interval between the training round number of the intermediate fault diagnosis model and the training round number of the first sample expansion may be calculated, the round number interval is compared with a preset round number interval, and if the round number interval is equal to the preset round number interval, it is determined that the training round number of the intermediate fault diagnosis model satisfies the preset round number interval; and if the round number interval is larger than or smaller than the preset round number interval, determining that the training round number of the intermediate fault diagnosis model does not meet the preset round number interval.

In some embodiments of the present invention, a round number difference between the training round number of the intermediate fault diagnosis model and the training round number of the first sample expansion is calculated, a remainder between the round number difference and a preset round number difference is determined, and if the remainder is 0, it is determined that the training round number of the intermediate fault diagnosis model satisfies a preset round number interval; and if the remainder is not 0, determining that the training round number of the middle fault diagnosis model does not meet the preset round number interval. For example, when the difference value of the preset turns is 4, if the number of training turns of the intermediate fault diagnosis model is 4 or a multiple of 4, determining that the number of training turns of the intermediate fault diagnosis model meets the preset turn interval, namely, performing error sample elimination once every time 3 turns of sample expansion are completed; when the difference value of the preset turns is 5, if the number of the training turns of the intermediate fault diagnosis model is 5 or a multiple of 5, determining that the number of the training turns of the intermediate fault diagnosis model meets the preset turn interval, namely, performing error sample elimination once every 4 turns of sample expansion are completed.

In some embodiments of the invention, if the training round number of the intermediate fault diagnosis model meets the preset round number interval, inputting the non-label sample data into the intermediate fault diagnosis model to obtain second target non-label data, and obtaining a new initial sample training set according to steps c 1-c 5 based on the second target non-label data and the label sample data; and training the intermediate fault diagnosis model according to the new initial sample training set until the training round number of the intermediate fault diagnosis model is greater than or equal to a preset round number threshold value, and obtaining a sample training set.

In some embodiments of the present invention, if the number of training rounds of the intermediate fault diagnosis model does not satisfy the preset round number interval, updating the initial sample training set based on the remaining target unlabeled data according to the sample expansion method to obtain a new initial sample training set, training the intermediate fault diagnosis model according to the new initial sample training set, and when the number of training rounds of the intermediate fault diagnosis model satisfies the preset round number interval, inputting unlabeled sample data into the intermediate fault diagnosis model satisfying the preset round number interval until the number of training rounds of the intermediate fault diagnosis model is greater than or equal to the preset round number threshold, to obtain the sample training set.

In some embodiments of the present invention, a sample training set is obtained when the number of training rounds of the intermediate fault diagnosis model is greater than or equal to a preset round threshold, a test sample is input into the current intermediate fault diagnosis model for testing, if the recognition accuracy of the test result is greater than or equal to the preset accuracy threshold, the sample expansion is stopped to obtain the sample training set, and the current intermediate fault diagnosis model is set as the fault diagnosis model; and if the identification precision of the test result is smaller than a preset precision threshold value, inputting the sample training set to the middle fault diagnosis model at the moment for training to obtain the fault diagnosis model.

In some embodiments of the present invention, if the identification accuracy of the test result is less than the preset accuracy threshold, the current intermediate fault diagnosis model is set as the initial model, and according to the sample generation method in the above steps, the labeled sample data is input to the initial model for training to obtain the initial fault diagnosis model, fault detection is performed on the unlabeled sample data based on the initial fault diagnosis model to obtain the prediction result of the unlabeled sample data, the unlabeled sample data is screened according to the prediction result of the unlabeled sample data to obtain target unlabeled data, and a sample training set is obtained based on the target unlabeled data and the labeled sample data, and this is repeated until the test accuracy identification accuracy of the current intermediate fault diagnosis model is greater than or equal to the preset accuracy threshold, so as to obtain the fault diagnosis model.

According to the embodiment of the invention, the target non-label data is selected based on the confidence coefficient threshold, and the non-label sample data is accurately added with the pseudo label by calculating the sample expansion rate and the sample secondary screening method, so that the expansion of the label sample data is realized, the data quality of the photovoltaic power station is greatly improved, the unbalance of different types of samples is gradually improved, the problems of over-fitting and under-fitting of network training when the unbalanced data is processed by the conventional method are solved, and the accuracy of subsequent fault diagnosis is further ensured.

In some embodiments of the present invention, after a sample training set is obtained, a fault diagnosis model may be trained based on the sample training set to obtain a trained fault diagnosis model corresponding to each photovoltaic power station, operation data of a photovoltaic module of each photovoltaic power station is identified based on the trained fault diagnosis model corresponding to the photovoltaic power station, and it is determined whether a fault exists in operation of the photovoltaic power station and a fault type corresponding to the fault when the fault exists.

In some embodiments of the present invention, it is considered that each photovoltaic power station can only collect samples of a part of fault types, which means that a fault diagnosis model established by means of a sample training set of a single photovoltaic power station can only identify a part of faults, and the fault types not involved in the sample training set cannot be diagnosed. Based on the above, the embodiment of the invention performs federal learning on the basis of the fault diagnosis models established by the plurality of photovoltaic power stations to obtain the combined diagnosis model, and each photovoltaic power station identifies the operation data of the photovoltaic module of the photovoltaic power station on the basis of the combined diagnosis model to determine whether the operation of the photovoltaic power station has faults or not and determine the corresponding fault type when the faults exist. Specifically, as shown in fig. 4, fig. 4 is a schematic flow chart of a method for obtaining a joint diagnosis model based on federal learning according to an embodiment of the present invention, where the method for obtaining a joint diagnosis model based on federal learning is applied to a server, and specifically includes steps 401 to 403:

401, obtaining the fault diagnosis model parameters sent by each electronic device. The parameters of the fault diagnosis model include, but are not limited to, the weight of the fault diagnosis model, the number of network layers, and the size of the network layers, for example, when the fault diagnosis model is a model based on a convolutional network, the parameters of the fault diagnosis model include, but are not limited to, the weight of the fault diagnosis model, the number of convolutional layers, the size of convolutional kernel, the step size of convolution, the size of pooling kernel, the step size of pooling, the manner of pooling, and the like. The fault diagnosis model is obtained by training a sample training set obtained by the sample generation method; and each photovoltaic power station is provided with one piece of electronic equipment, and each piece of electronic equipment corresponds to at least one group of fault diagnosis model parameters.

And 402, aggregating the parameters of the fault diagnosis models to obtain initial combined model parameters.

In some embodiments of the present invention, each fault diagnosis model parameter may be accumulated to obtain an initial combined model parameter.

In some embodiments of the present invention, the weight of each fault diagnosis model may be determined, and the initial combined model parameter may be obtained by accumulating the weight of each fault diagnosis model and the fault diagnosis model parameter of the fault diagnosis model. In some embodiments of the present invention, the weight of each fault diagnosis model may be determined according to the sample data amount of the sample training set corresponding to the fault diagnosis model, for example, a ratio between the sample data amount of the sample training set corresponding to each fault diagnosis model and the total sample data amount of the sample training sets corresponding to all fault diagnosis models may be set as the weight of the fault diagnosis model.

And 403, obtaining a combined diagnosis model according to the initial combined model parameters.

In some embodiments of the present invention, model building may be performed according to the initial joint model parameters to obtain a joint diagnosis model.

In some embodiments of the invention, after the joint diagnosis model is obtained, the joint diagnosis model is sent to the electronic equipment deployed in each photovoltaic power station, after the electronic equipment receives the joint diagnosis model, the joint diagnosis model is trained according to a sample training set, the test sample input value joint diagnosis model is subjected to model test to obtain the test precision, and when the test precision is greater than or equal to a preset precision threshold, the trained joint diagnosis model is obtained.

In some embodiments of the present invention, since the number of the generated sample training sets of each photovoltaic power station is different, when parameters are aggregated, the weight of each fault diagnosis model parameter may be determined according to the number of the generated sample training sets of each photovoltaic power station, and the number of the sample training sets of the electronic devices deployed in each photovoltaic power station is different and the calculation capability is different, so that the time for the electronic devices deployed in each photovoltaic power station to obtain the sample training sets and send the fault diagnosis model parameters is different, if parameter aggregation is performed after all the fault diagnosis model parameters corresponding to the electronic devices are received, which may result in problems of low data transmission efficiency and long model parameter aggregation time, in order to solve the above problems, in an embodiment of the present invention, the number of the received fault diagnosis models is recorded, and when the number reaches a preset ratio threshold of the number of all the electronic devices, model parameter aggregation is performed, specifically, the initial joint model parameter determination method includes steps e1 to e3:

step e1, determining a first number of received fault diagnosis model parameters.

And e2, if the first quantity is larger than a preset quantity threshold value, respectively determining the sample proportion of the training sample quantity of the fault diagnosis model corresponding to each electronic device.

In some embodiments of the present invention, if the first number is less than or equal to the predetermined number threshold, the waiting is continued.

In some embodiments of the invention, the preset number threshold may be obtained according to the theoretical number of the fault diagnosis model parameters participating in federal learning and the prediction ratio threshold. The theoretical number of the fault diagnosis model parameters participating in the federal learning refers to the number of the photovoltaic power stations participating in the joint modeling. The specific value of the preset proportion threshold is not limited in the embodiment of the present invention, and may be, for example, 60% or 70%.

In some embodiments of the invention, step e2 comprises:

(1) And determining the data volume of the sample training set of the fault diagnosis model corresponding to each electronic device.

(2) And summarizing the data quantity of the sample training set of the fault diagnosis model corresponding to each electronic device to obtain the total data quantity of the sample training set of the fault diagnosis model corresponding to each electronic device.

(3) And obtaining the sample proportion of the training sample amount of the fault diagnosis model corresponding to each electronic device by determining the proportion of the sample training set of the fault diagnosis model corresponding to each electronic device in the total data amount.

And e3, obtaining initial combined model parameters according to the sample proportion of the training sample amount of the fault diagnosis model corresponding to each electronic device and the fault diagnosis model parameters corresponding to each electronic device.

In some embodiments of the present invention, the failure diagnosis model parameter corresponding to each electronic device may be obtained by calculating a ratio of training samples of the failure diagnosis model corresponding to each electronic device and the failure diagnosis model parameter corresponding to each electronic device

And calculating to obtain initial joint model parameters. Wherein d is _x Representing the data volume of a sample training set of the photovoltaic power station x, K and D representing a first number of received fault diagnosis model parameters and a total data volume of the sample training set, respectively, G ^t Representing the initial joint model parameters obtained after the polymerization.

In some embodiments of the present invention, considering that the number of the sample training sets of the electronic devices deployed in each photovoltaic power station is different and the computing power is different, the time for the electronic devices deployed in each photovoltaic power station to obtain the sample training sets and send the fault diagnosis model parameters is different, and if the time for the electronic devices deployed in each photovoltaic power station to obtain the sample training sets and send the fault diagnosis model parameters is waited to receive all the fault diagnosis model parameters corresponding to the electronic devices, the parameter aggregation is performed, which may cause problems of low data transmission efficiency and long model parameter aggregation time, so that the fault diagnosis model parameters received in a preset time period may be aggregated according to the initial joint model parameter determination method to obtain the initial joint model parameters.

In some embodiments of the present invention, a first number of the fault diagnosis model parameters received within a preset time period may be determined, and when the first number is greater than a preset number threshold, aggregation may be performed according to steps e2 to e3 to obtain an initial combined model parameter.

In some embodiments of the present invention, after obtaining the initial joint diagnosis model parameter, model establishment may be performed based on the initial joint diagnosis model parameter to obtain an initial joint diagnosis model, the initial joint diagnosis model is sent to an electronic device that sends a fault diagnosis model within a preset time period, the electronic device tests the received initial joint diagnosis model, when the test accuracy is greater than or equal to a preset accuracy threshold, the initial joint diagnosis model is set as the joint diagnosis model, when the test accuracy is less than the preset accuracy threshold, feedback information is sent to a server, so that the server adjusts the initial joint diagnosis model parameter based on the received feedback information to obtain an adjusted initial joint diagnosis model, and the adjusted initial joint diagnosis model is sent to a power station device, and so on, until the test accuracy of the adjusted initial joint diagnosis model is greater than or equal to the preset accuracy threshold, the joint diagnosis model is obtained. Specifically, the method for establishing the combined diagnosis model comprises the following steps f 1-f 3:

and f1, establishing and obtaining an initial combined diagnosis model according to the initial combined model parameters.

And f2, sending the initial joint diagnosis model to each electronic device, and receiving first target feedback information returned by each electronic device. The first target feedback information is generated according to the first test precision when the first test precision is smaller than a preset precision threshold value.

In some embodiments of the present invention, after receiving an initial joint diagnosis model, the electronic device obtains a sample training set by the sample generation method to train the initial joint diagnosis model, so as to obtain a trained initial joint diagnosis model; performing model testing on the initial joint diagnosis model with the trained test sample data input value to obtain a test result corresponding to the test sample data, and obtaining first testing precision based on the test result corresponding to the test sample data and a real identification result corresponding to the test sample data; comparing the first test precision with a preset precision threshold; if the first test precision is smaller than the preset precision threshold, generating first target feedback information according to the difference value between the first test precision and the preset precision threshold, and sending the first target feedback information to the server; and if the first test precision is greater than or equal to the preset precision threshold, sending confirmation information to the server, and setting the trained initial joint diagnosis model as the joint diagnosis model.

And f3, adjusting the initial combined model parameters according to the first target feedback information returned by each electronic device to obtain a combined diagnosis model.

In some embodiments of the present invention, when receiving first target feedback information returned by each electronic device, the server adjusts initial joint model parameters according to the first target feedback information returned by each electronic device, to obtain adjusted initial joint model parameters, obtains an intermediate joint diagnosis model based on the adjusted initial joint model parameters, sends the intermediate joint diagnosis model to the electronic device, executes step f2, to obtain second target feedback information returned by each electronic device, performs re-adjustment on the adjusted initial joint model parameters based on the second target feedback information, to obtain a new intermediate joint diagnosis model, sends the new intermediate joint diagnosis model to the electronic device, repeats the steps until a preset stop condition is reached, and sets the new intermediate joint diagnosis model at this time as the joint diagnosis model. The preset stop condition may be that the number of times of adjusting the initial joint model parameter is greater than or equal to a preset number threshold, or that the number of received confirmation information is greater than or equal to a preset threshold.

In some embodiments of the present invention, in order to save cache and storage resources of a server, in the building of the joint diagnosis model, after sending an initial fault diagnosis model built based on initial joint model parameters obtained by aggregation to an electronic device each time, emptying received fault diagnosis model parameters, and waiting for feedback of the electronic device specifically includes: establishing an initial joint diagnosis model according to initial joint model parameters, sending the initial joint diagnosis model to each electronic device, training the initial joint diagnosis model by the electronic device through a sample training set obtained by the sample generation method to obtain a trained initial joint diagnosis model, performing model test on the trained initial joint diagnosis model with the input value of test sample data to obtain a test result corresponding to the test sample data, and obtaining first test precision based on the test result corresponding to the test sample data and a real identification result corresponding to the test sample data; comparing the first test precision with a preset precision threshold; if the first test precision is smaller than a preset precision threshold value, setting the trained initial joint diagnosis model as a new fault diagnosis model parameter of the electronic equipment, and sending the new fault diagnosis model parameter to a server; and after receiving the new fault diagnosis model parameters with the preset number threshold value, the server carries out polymerization again according to the initial joint model parameter determination method to obtain a new initial diagnosis joint model, sends the new initial diagnosis joint model to the electronic equipment, and sets the currently obtained new initial diagnosis joint model as the joint diagnosis model until the test precision of the currently obtained new initial diagnosis joint model is greater than or equal to the preset precision threshold value.

In some embodiments of the present invention, in the aggregation process, for a fault diagnosis model parameter received within a time period exceeding a preset time period, the fault diagnosis model parameter received within the time period exceeding the preset time period may be cached as a fault diagnosis model parameter for the next parameter aggregation. Therefore, the waste of computing resources is avoided, meanwhile, the aggregation frequency is accelerated, and particularly for the photovoltaic power stations with the slower model updating speed of the electronic equipment, the asynchronous updating and cache mechanism enables the photovoltaic power stations to participate in parameter aggregation in subsequent rounds.

The embodiment of the invention simultaneously considers the problems of insufficient number of label samples, limited types, unbalanced types and low data quality of label sample data and label-free sample data with obvious distribution difference of actual field photovoltaic power stations, solves the problem of training set data quality in actual operation and maintenance of the photovoltaic power stations by combining a small sample technology, greatly improves the training set data quality of the photovoltaic power stations, and improves the accuracy of subsequent diagnosis models by sample expansion; and the joint modeling of a plurality of photovoltaic power stations and the optimization of model cache are realized through a federal learning mechanism, so that the fault types of the photovoltaic power stations are fully utilized on the premise of guaranteeing privacy and communication efficiency, and the generalization of the model is greatly improved.

In some embodiments of the present invention, after the joint diagnosis model is obtained, for the electronic devices deployed on each photovoltaic power station, fault diagnosis may be performed through the joint diagnosis model, so as to obtain a fault diagnosis result. Specifically, the fault diagnosis method based on the combined diagnosis model comprises the following steps:

(1) And collecting the operation data of the photovoltaic module to be detected.

(2) And inputting the operation data into the combined diagnosis model for fault diagnosis to obtain a fault diagnosis result of the photovoltaic module to be detected.

According to the sample generation method provided by the embodiment of the invention, the problem of training set data quality in actual operation and maintenance of the photovoltaic power station is solved by combining a small sample technology, the training set data quality of the photovoltaic power station is greatly improved, and the accuracy of a subsequent diagnosis model is improved by sample expansion.

In order to better implement the sample generation method provided by the embodiment of the present invention, on the basis of the sample generation method, the embodiment of the present invention provides a sample generation apparatus, as shown in fig. 5, where fig. 5 is a schematic structural diagram of the sample generation apparatus provided by the embodiment of the present invention, and the sample generation apparatus shown includes:

the acquisition module is used for acquiring sample operation data of at least one photovoltaic module; the sample operation data comprises labeled sample data with a label and unlabeled sample data without the label;

the screening module is used for screening the non-label sample data according to the prediction result corresponding to the non-label sample data to obtain target non-label data;

In some embodiments of the invention, the screening module is configured to: determining a confidence threshold corresponding to each fault type according to a label in the label sample data; the label in the label sample data is a fault type label of the label sample data; according to the confidence coefficient threshold value corresponding to each fault type and the prediction result corresponding to the non-label sample data, performing data screening on the non-label sample data to determine to obtain target non-label data; the prediction result represents the confidence that the unlabeled sample data belongs to each fault type.

In some embodiments of the invention, the screening module is configured to: determining the type number of the fault types included in the obtained label sample data and the label sample data volume corresponding to each fault type according to the label in the label sample data; and determining a confidence coefficient threshold value corresponding to each fault type according to the type number of the fault types, the label sample data size corresponding to each fault type and the theoretical sample data size corresponding to each fault type.

In some embodiments of the invention, the sample module is configured to: according to the label in the label sample data, determining the label sample data volume corresponding to each fault type in the label sample data and the total sample data volume of the label sample data; the label in the label sample data is a fault type label of the label sample data; determining to obtain a sample expansion rate corresponding to each fault type according to the label sample data amount corresponding to each fault type, the total sample amount of the label sample data and a preset proportional factor; determining a pseudo label of the target non-label data according to the prediction result of the target non-label data; the pseudo label is a fault type label predicted by target label-free data; selecting prediction tag data corresponding to each fault type from the target non-tag data according to the sample expansion rate corresponding to each fault type and the pseudo tag of the target non-tag data; and obtaining a sample training set according to the predicted label data, the pseudo label of the predicted label data and the label sample data.

In some embodiments of the invention, the sample module is configured to: obtaining an initial sample training set based on target label-free data and label sample data; training the initial fault diagnosis model according to the initial sample training set to obtain an intermediate fault diagnosis model; inputting the residual unlabeled sample data except the target unlabeled data in the unlabeled sample data into the intermediate fault diagnosis model to obtain a prediction result corresponding to the residual unlabeled sample data; screening the residual unlabeled sample data according to a prediction result corresponding to the residual unlabeled sample data to obtain residual target unlabeled data; and obtaining a sample training set based on the residual target unlabeled data and the initial sample training set.

In some embodiments of the invention, the sample module is configured to: acquiring the number of training rounds of the intermediate fault diagnosis model;

if the training round number of the intermediate fault diagnosis model meets a preset round number threshold value, obtaining a sample training set according to the residual target unlabeled data and the initial sample training set; and if the training round number of the intermediate fault diagnosis model does not meet the preset round number threshold value, determining whether the training round number of the intermediate fault diagnosis model meets the preset round number interval, and determining a sample training set according to the determination result of whether the training round number of the intermediate fault diagnosis model meets the preset round number interval.

In some embodiments of the invention, a sample module for; if the training round number of the intermediate fault diagnosis model meets the preset round number interval, inputting label-free sample data into the intermediate fault diagnosis model to obtain second target label-free data, and obtaining a new initial sample training set based on the second target label-free data and the label sample data; training the intermediate fault diagnosis model according to the new initial sample training set until the number of training rounds of the intermediate fault diagnosis model is greater than or equal to a preset round threshold value, and obtaining a sample training set; if the training round number of the intermediate fault diagnosis model does not meet the preset round number interval, updating an initial sample training set based on the residual target non-label data to obtain a new initial sample training set, training the intermediate fault diagnosis model according to the new initial sample training set, and when the training round number of the intermediate fault diagnosis model meets the preset round number interval, inputting non-label sample data into the intermediate fault diagnosis model meeting the preset round number interval until the training round number of the intermediate fault diagnosis model is greater than or equal to a preset round number threshold value to obtain a sample training set.

The sample generation device provided by the embodiment of the invention solves the problem of training set data quality in actual operation and maintenance of the photovoltaic power station through a small sample combining technology, greatly improves the training set data quality of the photovoltaic power station, and improves the accuracy of a subsequent diagnosis model through sample expansion.

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 601 of one or more processing cores, memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 601 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the electronic device. Optionally, processor 601 may include one or more processing cores; preferably, the processor 601 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601.

The memory 602 may be used to store software programs and modules, and the processor 601 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602.

The electronic device further comprises a power supply 603 for supplying power to the various components, and preferably, the power supply 603 is logically connected to the processor 601 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are realized through the power management system. The power supply 603 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 604, and the input unit 604 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 601 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application program stored in the memory 602, thereby implementing various functions as follows:

collecting sample operating data of at least one photovoltaic module; the sample operation data comprises labeled sample data with a label and unlabeled sample data without the label;

training the initial model according to the label sample data to obtain an initial fault diagnosis model;

inputting the unlabeled sample data to an initial fault diagnosis model for fault detection to obtain a prediction result corresponding to the unlabeled sample data;

screening the non-label sample data according to a prediction result corresponding to the non-label sample data to obtain target non-label data;

and obtaining a sample training set based on the target label-free data and the label sample data.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the sample generation methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any sample generation method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any sample generation method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The sample generation method, the sample generation apparatus, the electronic device, and the storage medium according to the embodiments of the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core concept of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of sample generation, the method comprising:

according to a prediction result corresponding to the unlabeled sample data, screening the unlabeled sample data to obtain target unlabeled data;

2. The method of claim 1, wherein the step of performing a filtering process on the unlabeled sample data according to the prediction result corresponding to the unlabeled sample data to obtain target unlabeled data comprises:

determining a confidence threshold corresponding to each fault type according to the label in the label sample data; the label in the label sample data is a fault type label of the label sample data;

performing data screening on the non-tag sample data according to a confidence coefficient threshold corresponding to each fault type and a prediction result corresponding to the non-tag sample data to determine to obtain target non-tag data; the prediction result characterizes a confidence that the unlabeled sample data belongs to each of the fault types.

3. The method of generating samples according to claim 2, wherein the determining the confidence threshold corresponding to each fault type according to the label in the label sample data comprises:

according to the label in the label sample data, determining the type number of the fault types included in the label sample data and the label sample data size corresponding to each fault type;

and determining a confidence coefficient threshold value corresponding to each fault type according to the type number of the fault types, the label sample data size corresponding to each fault type and the theoretical sample data size corresponding to each fault type.

4. The method of any of claims 1-3, wherein the deriving a training set of samples based on the target unlabeled data and the labeled sample data comprises:

according to the label in the label sample data, determining the label sample data volume corresponding to each fault type in the label sample data and the total sample data volume of the label sample data; the label in the label sample data is a fault type label of the label sample data;

determining to obtain a sample expansion rate corresponding to each fault type according to the label sample data size corresponding to each fault type, the total sample amount of the label sample data and a preset proportional factor;

determining to obtain a pseudo label of the target non-label data according to the prediction result of the target non-label data; the pseudo label is a fault type label predicted by the target label-free data;

according to the sample expansion rate corresponding to each fault type and the pseudo label of the target non-label data, selecting the prediction label data corresponding to each fault type from the target non-label data;

and obtaining a sample training set according to the predicted label data, the pseudo label of the predicted label data and the label sample data.

5. The method of generating samples according to claim 1, wherein said obtaining a training set of samples based on the target unlabeled data and the labeled sample data comprises:

obtaining an initial sample training set based on the target non-label data and the label sample data;

training the initial fault diagnosis model according to the initial sample training set to obtain an intermediate fault diagnosis model;

inputting the residual unlabeled sample data except the target unlabeled data in the unlabeled sample data into the intermediate fault diagnosis model to obtain a prediction result corresponding to the residual unlabeled sample data;

screening the residual unlabeled sample data according to a prediction result corresponding to the residual unlabeled sample data to obtain residual target unlabeled data;

and obtaining a sample training set based on the residual target unlabeled data and the initial sample training set.

6. The sample generation method of claim 5, wherein the deriving a sample training set based on the remaining target unlabeled data and the initial sample training set comprises:

acquiring the number of training rounds of the intermediate fault diagnosis model;

if the training round number of the intermediate fault diagnosis model meets a preset round number threshold value, obtaining a sample training set according to the residual target unlabeled data and the initial sample training set;

and if the training round number of the intermediate fault diagnosis model does not meet the preset round number threshold value, determining whether the training round number of the intermediate fault diagnosis model meets the preset round number interval, and determining a sample training set according to the determination result of whether the training round number of the intermediate fault diagnosis model meets the preset round number interval.

7. The sample generation method according to claim 6, wherein the determining a sample training set according to a determination result of whether the number of training rounds of the intermediate fault diagnosis model satisfies a preset round interval includes:

if the training round number of the intermediate fault diagnosis model meets the preset round number interval, inputting the unlabeled sample data into the intermediate fault diagnosis model to obtain second target unlabeled data, and obtaining a new initial sample training set based on the second target unlabeled data and the labeled sample data; training the intermediate fault diagnosis model according to the new initial sample training set until the number of training rounds of the intermediate fault diagnosis model is greater than or equal to the preset round number threshold value, and obtaining a sample training set;

if the number of training rounds of the intermediate fault diagnosis model does not meet the preset round number interval, updating the initial sample training set based on the residual target unlabeled data to obtain a new initial sample training set, training the intermediate fault diagnosis model according to the new initial sample training set, and when the number of training rounds of the intermediate fault diagnosis model meets the preset round number interval, inputting the unlabeled sample data into the intermediate fault diagnosis model meeting the preset round number interval until the number of training rounds of the intermediate fault diagnosis model is greater than or equal to the preset round number threshold to obtain a sample training set.

8. A sample generation device, the device comprising:

the system comprises a collecting module, a judging module and a judging module, wherein the collecting module is used for collecting sample operation data of at least one photovoltaic module, and the sample operation data comprises labeled sample data with a label and unlabeled sample data without the label;

the screening module is used for screening the non-tag sample data according to the prediction result corresponding to the non-tag sample data to obtain target non-tag data;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample generation method of any one of claims 1 to 7.

10. A storage medium storing a plurality of instructions for causing a computer to perform the sample generation method according to any one of claims 1 to 7.