CN113077038A

CN113077038A - Industrial data feature selection method and device, computer equipment and storage medium

Info

Publication number: CN113077038A
Application number: CN202110349281.1A
Authority: CN
Inventors: 姜善成; 邓乐平; 韩瑜; 古博; 丁北辰
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-06

Abstract

The application relates to the technical field of industrial big data processing, and provides an industrial data selection method and device, computer equipment and a storage medium. According to the method and the device, the second part of feature subset is formed by the features with inconsistent codes according to the relative sizes of the prediction accuracy of the first parent individual and the second individual, so that the possibility that the gene of the parent with higher prediction accuracy is inherited by the filial generation is higher, the filial generation can obtain the better gene as much as possible, the whole population is optimized in the good direction faster, the optimization speed is improved, certain flexibility is reserved, and the key features in the industrial data are extracted accurately and quickly and effectively.

Description

Industrial data feature selection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technology of industrial big data, and in particular, to a method and an apparatus for selecting characteristics of industrial data, a computer device, and a storage medium.

Background

The method can generate a large amount of data in industrial scenes such as non-ferrous metal smelting process and the like, the data comprises internal manufacturing system data and external data of an enterprise, and the method has the characteristics of large volume, multiple sources, continuous sampling, low value density, high complexity, strong dynamics and the like. The characteristics make the method different from data streams such as the Internet and the like, and the analysis difficulty and the requirement on the analysis precision are relatively high. The high dimensionality of the big data is caused by the multiple sources and the complexity, and the dimensionality reduction of the data is needed for accurately analyzing and predicting the big data of the enterprise.

Feature selection is a common method to reduce data dimensionality. The principle is that a new low-dimensional space is constructed by utilizing an original feature space, redundant features and irrelevant features are eliminated, and the dimensionality of data is effectively reduced. Feature selection is the process of selecting key feature subsets that can efficiently describe the input data while reducing the effects of noise or uncorrelated variables, providing good predictive results.

However, the industrial data has high requirements for the real-time performance and prediction accuracy of processing analysis, and the traditional feature extraction methods such as principal component analysis, linear discriminant analysis and partial least squares cannot meet the requirements. Therefore, it is necessary and necessary to design a fast and effective feature selection method to accurately extract the key features in the industrial data, i.e., to select the optimal feature subset. In addition, the key features are extracted, the dimension reduction of data can be realized, and the calculation requirement is reduced. The method is a key important link for establishing a Service-Oriented Architecture (SOA) -based Service data cross-domain integration Architecture and realizing the cross-domain data fusion and integration of the non-ferrous metal smelting process industry.

Disclosure of Invention

In view of the above, it is necessary to provide an industrial data feature selection method, apparatus, computer device and storage medium for solving the above technical problems.

In a first aspect, an industrial data feature selection method is provided, and the method includes:

selecting a plurality of feature subsets from a feature set to be selected of the industrial data by using a feature selection mode represented by a plurality of individuals contained in the initial population; the initial population is used for characteristic screening;

selecting a first parent individual and a second parent individual from the plurality of individuals according to the prediction accuracy of the set prediction result of each feature subset pair;

when the gene codes of the first parent individual and the second parent individual are the same, the gene codes are reserved as the gene codes of the filial generations, and when the gene codes of the first parent individual and the second parent individual are different, the gene codes corresponding to the parent individual with higher prediction accuracy are reserved as the gene codes of the filial generations to obtain one filial generation;

updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on one child;

and selecting a target feature set of the industrial data from the feature sets to be selected based on the updated initial population.

In one embodiment, before the selecting the first parent individual and the second parent individual from the plurality of individuals according to the prediction accuracy of the set prediction result for each feature subset pair, the method further includes:

and inputting the plurality of feature subsets into a pre-constructed neural network, so that the neural network outputs the prediction accuracy corresponding to each feature subset.

In one embodiment, the neural network comprises a plurality of hidden layers, the activation function of the neural network adopts a linear rectification function, and the parallel calculation of the random gradient descent of the neural network is based on Hogwild! And (4) algorithm implementation.

In one embodiment, inputting a plurality of feature subsets into a pre-constructed neural network comprises:

and binary coding the features in the feature subsets and inputting the features into the neural network.

In one embodiment, selecting a first parent individual and a second parent individual from a plurality of individuals according to the prediction accuracy of the set prediction result for each feature subset pair includes:

based on a tournament selection strategy, at least two individuals are randomly selected from an initial population, and an individual with high prediction accuracy in the at least two individuals is used as a parent screening mode of parent individuals, so that a first parent individual and a second parent individual are sequentially obtained.

In one embodiment, updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on one child includes:

selecting a new characteristic subset from the characteristic set to be selected by utilizing the characteristic selection mode represented by the filial generation;

calculating the prediction accuracy of the filial generation according to the new characteristic subset;

and if the prediction accuracy of the filial generation is greater than the average prediction accuracy of the initial population, randomly replacing one individual with the accuracy lower than that of the filial generation in the initial population by the filial generation.

In one embodiment, before selecting a plurality of feature subsets from a candidate feature set of the industrial data by using a feature selection manner characterized by a plurality of individuals included in the initial population, the method further includes:

and converting the industrial data to be processed into a feature vector, and preprocessing the feature vector to obtain a feature set to be selected.

In a second aspect, an industrial data feature selection apparatus is provided, the apparatus comprising:

the characteristic subset selection module is used for selecting a plurality of characteristic subsets from a characteristic set to be selected of the industrial data by utilizing a characteristic selection mode represented by a plurality of individuals contained in the initial population; the initial population is used for characteristic screening;

the parent selection module is used for selecting a first parent individual and a second parent individual from the plurality of individuals according to the prediction accuracy of the set prediction result of each feature subset pair;

the filial generation module is used for reserving the gene codes of the filial generations when the gene codes of the first parent individual and the second parent individual are the same, and reserving the gene codes corresponding to the parent individual with higher prediction accuracy as the gene codes of the filial generations when the gene codes of the first parent individual and the second parent individual are different to obtain one filial generation;

the population optimization module is used for updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on one filial generation;

and the characteristic selection module is used for selecting a target characteristic set of the industrial data from the characteristic sets to be selected based on the updated initial population.

In a third aspect, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the method according to any one of the first aspect as described above when the processor executes the computer program.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the first aspects described above.

According to the industrial data feature selection method, the device, the computer equipment and the storage medium, the feature with inconsistent codes forms the second part of feature subset according to the relative sizes of the prediction accuracy of the first parent individual and the second individual, so that the possibility that the gene of the parent with higher prediction accuracy is inherited by the filial generation is higher, the filial generation can obtain a better gene as far as possible, the whole population is optimized in a good direction, the optimization speed is improved, certain flexibility is reserved, and the key features in the industrial data are extracted quickly and effectively.

Drawings

FIG. 1 is a schematic flow diagram of a method for industrial data selection in one embodiment;

FIG. 2 is a schematic flow diagram of a method for industrial data selection in one embodiment;

FIG. 3 is a flowchart illustrating the step of updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on one child under one embodiment;

FIG. 4 is a schematic flow diagram of a method for industrial data selection in one embodiment;

FIG. 5 is a schematic flow diagram of a method for industrial data selection in one embodiment;

FIG. 6 is a detailed flow diagram of a method for industrial data selection in one embodiment;

FIG. 7 is a block diagram of an industrial data selection device in one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, an industrial data feature selection method is provided, and this embodiment is illustrated by applying the method to a terminal, and it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step S102, selecting a plurality of feature subsets from a feature set to be selected of industrial data by using a feature selection mode represented by a plurality of individuals included in an initial population; the initial population was used for feature screening.

Where the initial population is randomly generated, it may include m individuals, each with n genes. The initial population is used as an initial feature space, features are binary coded by 0-1, a gene with a value of "1" indicates that the corresponding input feature is selected, and "0" indicates that it is not selected. Thus, after binary encoding, each individual of the initial population can represent the gene structure with an n-bit binary string.

Specifically, each individual in the initial population is regarded as a feature selection mode, and features in the feature set to be selected are respectively selected to obtain feature subsets corresponding to different individuals.

And step S104, selecting a first parent individual and a second parent individual from the plurality of individuals according to the prediction accuracy of the set prediction result of each feature subset pair.

In an alternative embodiment of the present application, the prediction accuracy is calculated by a Root Mean Square Error (RMSE) of the set prediction result, and the formula is:

wherein y is_i，

Respectively representing the target actual value and the target predicted value of the sample data, and s represents the number of the sample data.

Two parent individuals are selected from the m individuals of the initial population using the above prediction accuracy.

And S106, when the gene codes of the first parent individual and the second parent individual are the same, reserving the gene codes as the gene codes of the filial generations, and when the gene codes of the first parent individual and the second parent individual are different, reserving the gene code corresponding to the parent individual with higher prediction accuracy as the gene code of the filial generations to obtain one filial generation.

When the prediction accuracy of a parent individual is higher, the probability that its gene (0 or 1) is retained to a child is higher. Specifically, the gene codes of the first parent individual and the second parent individual are compared starting from the first gene when P₁[i]＝P₂[i]When, C [ i ]]＝P₁[i]Otherwise, the following formula is pressed:

wherein P is_iDenotes the i-th gene (0 or 1), f on P individuals_PIndicates the RMSE values of P individuals and C indicates the progeny produced.

And step S108, updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on one child.

And the newly generated filial generation can obtain a new feature subset, the corresponding prediction accuracy is calculated according to the new feature subset, and whether the initial population is updated and optimized is judged based on the prediction accuracy. In an optional embodiment of the present application, whether the offspring is repeated with the existing individuals in the initial population is compared, and whether the prediction accuracy of the offspring is improved compared with the average prediction accuracy of the initial population is judged to determine whether to replace the individuals in the initial population with the offspring, otherwise, new offspring is regenerated until the update condition is met.

And step S110, selecting a target feature set of the industrial data from feature sets to be selected based on the updated initial population.

Steps S104, S106, and S108 are repeatedly performed, and new children are continuously generated and the population is continuously updated. And when the maximum iteration times or the target optimization value is reached, namely the iteration times or the population average fitness value reaches the expected threshold value, obtaining the final population. Each individual in the final population represents a feature selection means, but will generally tend to select features that are selected by most individuals as the optimal subset of features, key features in the industry's big data.

In the industrial data feature selection method, the second partial feature subset is formed by the features with inconsistent codes according to the relative sizes of the prediction accuracy of the first parent individual and the second individual, so that the higher the prediction accuracy, the higher the possibility that the gene of the parent is inherited by the filial generation is, the filial generation can obtain the better gene as much as possible, the whole population can be optimized in a good direction, the optimization speed is improved, and meanwhile, certain flexibility is reserved, so that the key features in the industrial data can be extracted quickly and effectively.

In one embodiment, as shown in fig. 2, before the selecting the first parent individual and the second parent individual from the plurality of individuals according to the prediction accuracy of the prediction result set by each feature subset pair, the industrial data feature selection method further includes:

step S202, a plurality of characteristic subsets are input into a pre-constructed neural network, so that the neural network outputs the prediction accuracy corresponding to each characteristic subset.

The prediction results of the neural network are used to evaluate the performance of the feature subset. In an alternative embodiment of the present application, the feature encoding the gene as 1 is retained as an input element to the neural network.

In the embodiment, the neural network is used as a training method and an evaluation index of the feature subset, so that the algorithm efficiency is improved, and the prediction error is reduced.

In particular, the neural network is specifically configured to:

a. the neural network is provided with 3 hidden layers, and the calculation speed is influenced by a too complex network;

b. the activation function is a linear rectification function (ReLU), and the function formula is as follows:

f(x)＝max(0，x)

c. hogwild! Algorithms are used to implement parallel computations of random gradient descent.

By combining the three configuration modes, the training efficiency is greatly improved, and the real-time training and monitoring requirements of industrial big data can be met.

In one embodiment, inputting a plurality of feature subsets into a pre-constructed neural network comprises: and binary coding the features in the feature subsets and inputting the features into the neural network. Specifically, the feature of encoding the gene as 1 is retained as an input element of the neural network.

In one embodiment, selecting a first parent individual and a second parent individual from a plurality of individuals according to the prediction accuracy of the set prediction result for each feature subset pair includes: based on a tournament selection strategy, at least two individuals are randomly selected from an initial population, and an individual with high prediction accuracy in the at least two individuals is used as a parent screening mode of the parent individual, so that a first parent individual and a second parent individual are sequentially obtained.

Randomly selecting 2 individuals from the initial population, wherein the smaller RMSE value is reserved as a first parent individual P₁(ii) a Repeat "random selection of 2 individuals from the initial population, where the RMSE values were moreSmall reserved' step, resulting in a second parent P₂。

In one embodiment, as shown in fig. 3, updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on one child includes: step S302, selecting a new characteristic subset from the characteristic set to be selected by utilizing the characteristic selection mode represented by the filial generation; step S304, calculating the prediction accuracy of the filial generation according to the new characteristic subset; and S306, if the prediction accuracy of the filial generation is greater than the average prediction accuracy of the initial population, randomly replacing one individual with the prediction accuracy less than that of the filial generation in the initial population by the filial generation.

Firstly, a new feature subset is selected from the candidate feature set of the industrial data by taking a newly generated descendant as a new feature selection mode. The feature with gene code 0 is discarded and the feature with gene code 1 is retained as input to the neural network to calculate the RMSE value of the offspring.

Then, whether to update the optimized population is judged. Specifically, whether the offspring is duplicated with the individuals in the initial population or not is compared, and whether the prediction accuracy of the offspring is improved or not is judged, namely whether the RMSE value of the offspring is smaller than the average RMSE value of the initial population or not is judged.

Finally, when the offspring is not duplicated with the individuals in the initial population and the prediction accuracy is improved, the offspring is randomly substituted for one individual in the initial population that is greater than the average RMSE. Otherwise, new children are regenerated until the update condition is satisfied.

In an embodiment, as shown in fig. 4, before selecting a plurality of feature subsets from a candidate feature set of the industrial data by using a feature selection manner characterized by a plurality of individuals included in the initial population, the method further includes:

and S402, converting the industrial data to be processed into a feature vector, and preprocessing the feature vector to obtain a feature set to be selected.

First, the collected data is converted into feature vectors. The feature vector of a data sample i is denoted x_i＝(x_i1，x_i2，...，x_ik，...，x_in), all of sAccording to the sample set T { (x)₁，y₁)，(x₂，y₂)，...，(x_i，y_i)，...，(x_s，y_s)}. Wherein n represents a total of n features, (x)_i，y_i) Is a sample point, x_ik is a feature point representing the kth feature of the sample i.

Then, character type data characteristics without size comparison are processed by adopting one-hot coding, character type data characteristics with size comparison are processed by adopting digital coding, and continuous type data characteristics are processed by adopting numerical standardization. The numerical normalization is to avoid the feature from being too important or too small, i.e. the data is transformed to the range of 0 as the mean value and 1 as the standard deviation, and the numerical normalization formula:

where mean (k) is the mean of the characteristic column corresponding to the kth feature in the sample set, σ (k) is the standard deviation of the characteristic column, x_ik_stdIs the normalized value of the kth feature of sample i.

In a specific implementation scenario of the application, the industrial data to be processed may include data provided by a nonferrous metal processing company, for example, the enterprise integrates key technologies such as high-temperature heat pipe, image recognition, voice recognition and the like through digital tools and devices such as deployed sensors, smart cameras, radio frequency recognition, gateways and the like, and the collected production field data including device data, product identification data, factory environment data and the like, and relevant data such as production process and key equipment operation data and states. The industrial data to be processed can also comprise data collected by a control system of the key working procedures of nonferrous smelting, such as pyrometallurgical smelting, a waste heat boiler, flue gas dust collection, high-temperature smoke discharge, pipeline dissolution, extraction separation and the like.

All collected data can be classified into the following features, including: the material composition on-line detection, the smoke composition on-line detection, the product appearance quality detection, the spectrum analyzer, the fluorescence analyzer and other universal instruments and meters; the method comprises the following steps of fire metallurgy acquisition characteristics of furnace body molten pool height detection, furnace kiln thermal field image recognition, automatic solid material sampling analysis, automatic material pile form monitoring, melt temperature online detection, hearth temperature online detection, melt component online detection and the like; the characteristics of hydrometallurgical collection such as liquid level detection in an autoclave, on-line analysis of solution/slurry components/on-line detection of pH value, on-line detection of potential, judgment of pipe scab thickness, on-line particle size analyzer, mud layer detector, on-line turbidimeter, pipe abrasion detector and the like; the characteristics of electrometallurgy collection such as cathode temperature on-line detection, cathode and anode current on-line detection, aluminum liquid and electrolyte level on-line detection device, electrolytic cell on-line temperature measurement device, thermal imaging analyzer, electrolytic cell polar plate short circuit automatic identification, etc.; other features.

As shown in fig. 5, the method for selecting characteristics of industrial data of the present application may include two stages:

step one, enterprise data is collected and preprocessed; specifically, character type data features without size comparison are processed by adopting one-hot coding, character type data features with size comparison are processed by adopting digital coding, and continuous type data features are processed by adopting numerical standardization;

and in the second stage, a feature selection method based on the fitness value is adopted as an optimization strategy of the feature subset, a genetic algorithm is used as a main body of the feature optimization method, and a neural network method is used as a training method and an evaluation index of the feature subset. The two stages jointly form an industrial data feature selection method, and an optimal feature subset is finally obtained by continuously iteratively updating the population; specifically, the feature subset obtained by initializing the population is subjected to feature selection based on the fitness value, so that the population is updated, and the optimal feature subset is obtained when the termination condition is met.

In one embodiment, as shown in fig. 6, a specific process of the method for selecting characteristics of industrial data provided by the present application includes:

the first step is as follows: importing data collected by enterprises, including but not limited to data collected by a control system of a colored smelting key process such as pyrometallurgical smelting, a waste heat boiler, flue gas dust collection, high-temperature smoke discharge, pipeline dissolution, extraction separation and the like;

the second step is that: extracting various characteristic data according to different data types, and preprocessing the data;

the third step: generating an initial population as an initial feature space, and carrying out 0-1 binary coding on the features;

the fourth step: taking each individual in the population generated in the third step as a feature selection mode, respectively selecting the well-processed features in the second step to obtain different feature subsets, and adopting the prediction accuracy of a neural network as the calculation of a fitness function in a genetic algorithm and the evaluation of feature selection result performance;

the fifth step: selecting two parents from the population by adopting championship selection, and generating a new filial generation based on the cross operation of the fitness value;

and a sixth step: using the generated filial generation as a new feature selection mode to obtain a new feature subset, calculating the fitness value of the feature subset and judging whether to optimize the population;

the seventh step: and repeating the fifth step and the sixth step until the maximum iteration times are reached or the algorithm termination condition is met, and finally, the features selected by the individuals in the population are the optimal feature subset.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention can provide a key feature extraction strategy for other algorithms;

(2) the method combines the advantages of a genetic algorithm and a deep neural network algorithm, improves the algorithm efficiency and reduces the prediction error;

(3) the invention improves the traditional genetic algorithm, and the offspring retains the joint information of the feature combination. .

It should be understood that although the various steps in the flow charts of fig. 1-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-6 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 7, there is provided an industrial data feature selection apparatus including: a feature subset selection module 702, a parent selection module 704, a child generation module 706, a population optimization module 708, and a feature selection module 710, wherein:

a feature subset selection module 702, configured to select, by using a feature selection manner represented by a plurality of individuals included in the initial population, a plurality of feature subsets from a feature set to be selected of the industrial data; the initial population is used for characteristic screening;

a parent selection module 704, configured to select a first parent individual and a second parent individual from the multiple individuals according to the prediction accuracy of each feature subset pair set as a prediction result;

an offspring generation module 706, configured to, when the gene codes of the first parent individual and the second parent individual are the same, reserve the gene code of the offspring, and when the gene codes of the first parent individual and the second parent individual are not the same, reserve the gene code corresponding to the parent individual with the higher prediction accuracy as the gene code of the offspring, to obtain an offspring;

a population optimization module 708, configured to update the initial population according to a prediction accuracy corresponding to a new feature subset selected based on one child;

and the feature selection module 710 is configured to select a target feature set of the industrial data from the feature sets to be selected based on the updated initial population.

In one embodiment, the industrial data feature selection apparatus further comprises:

and the prediction accuracy calculation module is used for inputting the plurality of feature subsets into a pre-constructed neural network so that the neural network outputs the prediction accuracy corresponding to each feature subset.

In one embodiment, the prediction accuracy calculation module is further configured to binary code the features in the feature subsets and input the binary coded features into the neural network.

In an embodiment, the parent selection module 704 is further configured to, based on a tournament selection policy, sequentially obtain the first parent individual and the second parent individual in a parent screening manner in which at least two individuals are randomly selected from the initial population and an individual with a high prediction accuracy of the at least two individuals is used as a parent individual.

In one embodiment, the population optimization module 708 is further configured to select a new feature subset from the feature set to be selected by using a feature selection manner characterized by the child; calculating the prediction accuracy of the filial generation according to the new characteristic subset; and if the prediction accuracy of the filial generation is greater than the average prediction accuracy of the initial population, randomly replacing one individual with the accuracy lower than that of the filial generation in the initial population by the filial generation.

and the data preprocessing module is used for converting the industrial data to be processed into the characteristic vectors and preprocessing the characteristic vectors to obtain a feature set to be selected.

For specific limitations of the industrial data feature selection device, reference may be made to the above limitations of the industrial data feature selection method, which are not described herein again. The modules in the industrial data feature selection device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing industrial data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an industrial data feature selection method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of:

based on a tournament selection strategy, at least two individuals are randomly selected from an initial population, and an individual with high prediction accuracy in the at least two individuals is used as a parent screening mode of the parent individual, so that a first parent individual and a second parent individual are sequentially obtained.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of industrial data feature selection, the method comprising:

selecting a plurality of feature subsets from a feature set to be selected of the industrial data by using a feature selection mode represented by a plurality of individuals contained in the initial population; the initial population is used for feature screening;

according to the prediction accuracy of each feature subset pair set prediction result, selecting a first parent individual and a second parent individual from the plurality of individuals;

when the gene codes of the first parent individual and the second parent individual are the same, reserving the gene codes as the gene codes of the filial generations, and when the gene codes of the first parent individual and the second parent individual are different, reserving the gene code corresponding to the parent individual with higher prediction accuracy as the gene code of the filial generations to obtain a filial generation;

updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on the one child;

2. The method of claim 1, wherein before the selecting the first parent individual and the second parent individual from the plurality of individuals based on the prediction accuracy of the prediction result set by each of the feature subset pairs, the method further comprises:

inputting the plurality of feature subsets into a pre-constructed neural network, so that the neural network outputs the prediction accuracy corresponding to each feature subset.

3. The method of claim 2, wherein the neural network comprises a plurality of hidden layers, wherein the activation function of the neural network is a linear rectification function, and wherein the parallel computation of the stochastic gradient descent of the neural network is based on Hogwild | based on the parallel computation of the stochastic gradient descent of the neural network! And (4) algorithm implementation.

4. The method of claim 2,

the inputting the plurality of feature subsets into a pre-constructed neural network comprises:

5. The method according to any one of claims 1 to 4, wherein the selecting a first parent individual and a second parent individual from the plurality of individuals according to the prediction accuracy of the prediction result set by each feature subset pair comprises:

and based on a tournament selection strategy, randomly selecting at least two individuals from the initial population, and taking the individuals with high prediction accuracy from the at least two individuals as a parent screening mode of parent individuals to sequentially obtain the first parent individual and the second parent individual.

6. The method of any one of claims 1 to 4, wherein said updating said initial population according to the prediction accuracy corresponding to a new feature subset selected based on said one child comprises:

calculating the prediction accuracy of the descendants according to the new feature subset;

and if the prediction accuracy of the filial generation is greater than the average prediction accuracy of the initial population, randomly replacing one individual with the prediction accuracy less than that of the filial generation in the initial population by the filial generation.

7. The method according to any one of claims 1 to 4, wherein before selecting a plurality of feature subsets from the candidate feature set of the industrial data using the feature selection manner characterized by the plurality of individuals included in the initial population, the method further comprises:

converting the industrial data to be processed into a feature vector, and preprocessing the feature vector to obtain the feature set to be selected.

8. An industrial data feature selection apparatus, the apparatus comprising:

the characteristic subset selection module is used for selecting a plurality of characteristic subsets from a characteristic set to be selected of the industrial data by utilizing a characteristic selection mode represented by a plurality of individuals contained in the initial population; the initial population is used for feature screening;

the parent selection module is used for selecting a first parent individual and a second parent individual from the plurality of individuals according to the prediction accuracy of the feature subset pair set prediction result;

the filial generation module is used for reserving the gene codes of the filial generations when the gene codes of the first parent individual and the second parent individual are the same, and reserving the gene codes corresponding to the parent individual with higher prediction accuracy as the gene codes of the filial generations when the gene codes of the first parent individual and the second parent individual are different, so as to obtain one filial generation;

the population optimization module is used for updating the initial population according to the prediction accuracy corresponding to the new feature subset selected based on the filial generation;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.