CN115600121B

CN115600121B - Data hierarchical classification method and device, electronic equipment and storage medium

Info

Publication number: CN115600121B
Application number: CN202210446117.7A
Authority: CN
Inventors: 张明; 张儒; 郭震; 金云峰; 孙自飞; 甘雨; 路明标; 姜栋
Original assignee: Nanjing Tianfu Software Co ltd
Current assignee: Nanjing Tianfu Software Co ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2023-11-07
Anticipated expiration: 2042-04-26
Also published as: CN115600121A

Abstract

The disclosure relates to the technical field of data processing in ship profile design, and provides a data hierarchical classification method and device, electronic equipment and storage medium, which are applied to the ship profile design, wherein the method comprises the following steps: s101, pre-segmentation of an original data set; s102, classifying a training sub-data set; s103, verifying a data segmentation scheme; s104 selects the final data segmentation scheme. The method is based on objective limitation of data scale in industrial design, aims at the problems that multiple mixed modes exist in an industrial design data set or consistency inside a data set is poor, performs pre-processing on the data set by using a data layering classification method in the ship profile design driven by the industrial data set for the first time, mines multiple mixed modes inside a sample training set, purifies the data set quality by pre-layering operation, improves the accuracy of data modeling, improves the utilization rate of designers on accumulated ship profile data, has wide application range, and effectively assists intelligent design of the ship profile.

Description

Data hierarchical classification method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of data processing in ship profile design, and in particular relates to a data hierarchical classification method and device, electronic equipment and storage medium.

Background

Conventional marine design is essentially an empirical design, and the final decision result depends largely on the subjective experience and knowledge structure level of the decision maker. The expert consultation method (Delphi) uses subjective judgment of an expert as a decision basis, uses scores, indexes, ordinal numbers, comments and the like as evaluation criteria, is a simple, theoretical and systematic method, and is difficult to ensure objective authenticity of an evaluation result. Analytic hierarchy process (Analytic Hierarchy Process, AHP) is used to study multi-objective decision problems with more complex structures, which can quantify qualitative problems, thereby making the evaluation result more scientific and reasonable. The method obtains a judgment matrix reflecting the relative importance of each attribute through pairwise comparison of evaluation indexes, so that the reliability is high, the error is small, and the method has the defect that the judgment matrix is difficult to meet the consistency requirement due to the limitation of a knowledge structure, personal preference, judgment level and the like of a decision maker.

With the advent and development of advanced intelligent technology, scientific and reasonable decision-making modes are introduced into the design process, the concept of a decision support system (Decision Support System, DSS) pushes decision-making theory to a new development climax, great achievements are achieved in the fields of system engineering, management science and the like, and the method is commonly used for solving the decision-making problem of a semi-structured and unstructured complex information system. In recent years, the advent of advanced intelligent technologies such as online analytical processing (Online Analytical Processing, OLAP) and Data Mining (DM) based on Data Warehouse (DW) has opened a new way for the development of DSS.

The ship type decision support system consists of a database and database management module, a model library and model library management module, a knowledge library and knowledge library management module, a data warehouse and data warehouse management module, a data mining module, a knowledge discovery module, a man-machine interaction module and the like. The data mining module and the knowledge discovery module are responsible for carrying out operations such as inquiring, analyzing, mining, selecting and evaluating on data, and mining decision information hidden in the data by adopting intelligent technologies such as genetic algorithm, neural network, statistical analysis, machine learning, fuzzy decision and the like.

The intelligence technology is driven by data, and how to use the ship-shaped data accumulated by enterprises to provide efficient references for designers is the main research content of data mining. In the prior art, a ship type highly related to a design requirement is mostly considered to be selected from accumulated ship type data as a mother type so as to guide ship type design, but the utilization rate of the ship type data is extremely low, only excellent ship type data highly related to the design requirement can be utilized, and the mutual connection among the selected ship types is not considered.

The introduction of artificial intelligence technology-based proxy model training technology is one of the key technologies for solving the above problems. In view of the current situation that ship-type test or measured data is limited, the training samples of the proxy model may be simulation data samples provided by computational fluid dynamics (Computational Fluid Dynamics, CFD) solution tools, and the test or measured data may be used to correct the CFD solution model or boundary conditions. By the technology, most of data in the ship type database can be utilized, so that designers are guided to carry out ship type design, and the utilization rate of ship type data is greatly improved. Meanwhile, the evaluation time of the agent model is far smaller than the CFD simulation calculation time, and the use of the agent model can greatly shorten the engineering design period.

Agent model training technology based on artificial intelligence technology can effectively solve the problems of long data utilization rate and design period, but the training and use of agent model have some problems. For example, the difficulty of improving the training accuracy of the proxy model is increased due to the limited number of data training samples, the consistency of sample point types and the like, and particularly in the case of high single-point value density and small data set scale of the data of the front design section of the industrial process in the data-driven learning problem of the industrial data set. At this time, since learning algorithms such as machine learning and the like generally have a data-starvation (data-hunry) problem and a dimension-curse (dimension-curse) problem, that is, the stronger the model nonlinear expression capability is, the higher the requirements for the training data scale and diversity are; however, the algorithm model with general nonlinear expression capability cannot effectively extract complex mapping modes in the training data set, and the model effect is difficult to bear related applications.

Due to objective limitations of the size of the industrial design segment data set, complex models such as deep learning and the like cannot be effectively activated and used, only machine learning algorithms with stronger statistical properties can be used, and the nonlinear expression capability of the machine learning algorithms is limited, and especially under the condition that the industrial design segment data set has multiple mixed modes or the consistency inside the data set is poor, the modeling effect of the learning algorithms is further weakened.

Disclosure of Invention

The disclosure aims to at least solve one of the problems in the prior art, and provides a data hierarchical classification method and device, electronic equipment and storage medium.

In one aspect of the disclosure, a data hierarchical classification method is provided, which is applied to hull profile design, and includes the following steps:

pre-segmentation of the original dataset: according to the sample classification number specified by a user, carrying out clustering and layering treatment on an original data set by adopting a Gaussian mixture model (Gaussian Mixture Model, GMM), and dividing the original data set into a plurality of sub data sets corresponding to the sample classification number to obtain a current data dividing scheme, wherein the original data set is an industrial data set in a ship profile design;

classifying the training sub-data set: adding a sub-class label to each sub-data set to obtain a training data set, and training a random forest (RandomForest, RF) classifier based on the training data set to obtain a sub-class classifier;

verification of the data segmentation scheme: carrying out regression training on an original data set and a plurality of sub-data sets by using a regression algorithm model based on a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT) to obtain an original data set regression model and a plurality of sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a sub-class classifier and cross verification;

Selecting a final data segmentation scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

Optionally, the cross-validation includes K-fold cross-validation, and after obtaining the original dataset regression model and the plurality of sub-dataset regression models, determining the proxy performance of the current data partitioning scheme and the proxy performance of the original dataset respectively in combination with the sub-class classifier and the cross-validation includes:

dividing the original data set into K original data subsets randomly, taking one of the original data subsets as a test set in turn, taking the corresponding rest of the original data subsets as a training set, training and testing a plurality of sub-data set regression models and an original data set regression model based on the training set, the test set and a subclass classifier, and respectively obtaining errors corresponding to the plurality of sub-data set regression models and errors corresponding to the original data set regression model;

and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively based on errors corresponding to the multiple sub-data set regression models and errors corresponding to the original data set regression models.

Optionally, training and testing the multiple sub-dataset regression models and the original dataset regression model based on the training set, the testing set and the sub-classifier to obtain an error corresponding to the multiple sub-dataset regression models and an error corresponding to the original dataset regression model, respectively, including:

training a plurality of sub-dataset regression models based on the training set;

judging the subclass category of each sample in the test set based on the subclass classifier, determining a sub-data set regression model corresponding to each sample from the trained multiple sub-data set regression models based on the judged subclass category, and respectively inputting each sample in the test set into the corresponding sub-data set regression model to obtain a predicted value corresponding to each sample;

and determining errors corresponding to the regression models of the plurality of sub-data sets based on the true values and the corresponding predicted values of the samples in the test set.

Optionally, determining the errors corresponding to the regression models of the plurality of sub-data sets based on the true values of the samples in the test set and the corresponding predicted values thereof includes:

determining errors corresponding to the regression models of the plurality of sub-data sets according to the following formula (1):

where j=1, 2, …, K is the number of the test set, E _j For a plurality of sub-dataset regression models, the relative mean absolute error (Relative Mean Absolute Error, RMAE) corresponding to test set j, i=1, 2, …, n is the sample number in test set j, n is the number of samples in test set j, y _i To test the true value of the ith sample in set j,and the predicted value corresponding to the ith sample in the test set j.

Optionally, determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub-data set regression models and the errors corresponding to the original data set regression models respectively includes:

determining proxy performance of the current data partitioning scheme according to the following equation (2):

wherein split is the proxy performance of the current data partitioning scheme.

Optionally, training and testing the multiple sub-dataset regression models and the original dataset regression model based on the training set, the testing set and the sub-classifier to obtain errors corresponding to the multiple sub-dataset regression models and errors corresponding to the original dataset regression models, respectively, and further including:

training a regression model of the original data set based on the training set;

respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;

And determining errors corresponding to the regression model of the original data set based on the true values of the samples in the test set and the corresponding predicted values thereof.

Optionally, a random forest classifier is built based on a classification regression tree (Classification And Regression Tree, CART) model.

In another aspect of the present disclosure, there is provided a data hierarchical classification apparatus for use in hull form line design, the apparatus comprising:

the front segmentation module is used for front segmentation of the original data set: according to the sample classification number designated by a user, clustering and layering processing is carried out on an original data set by adopting a Gaussian mixture model, and the original data set is divided into a plurality of sub data sets corresponding to the sample classification number, so that a current data segmentation scheme is obtained, wherein the original data set is an industrial data set in a ship profile design;

the classification training module is used for classifying and training the sub-data set: adding a sub-class label to each sub-data set to obtain a training data set, and training the random forest classifier based on the training data set to obtain a sub-class classifier;

the verification module is used for verifying the data segmentation scheme: carrying out regression training on an original data set and a plurality of sub-data sets by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a sub-class classifier and cross verification;

A selection module, configured to select a final data segmentation scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

In another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data hierarchical classification method described above.

In another aspect of the disclosure, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, implements the data hierarchical classification method described above.

Compared with the prior art, the method is based on objective limitation of the data scale of the industrial design problem, aims at the problems that a plurality of mixed modes exist in the industrial design section data set or the consistency inside the data set is poor, performs pre-processing on the data set by using a data layering classification method in the ship profile design driven by the industrial data set for the first time, purifies the data set quality by excavating a plurality of mixed modes inside the sample training set through pre-layering operation, improves the accuracy of data modeling, improves the utilization rate of the accumulated ship profile data of enterprises by designers, has wide application range, and effectively assists in intelligent design of the ship profile.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures do not depict a proportional limitation unless expressly stated otherwise.

FIG. 1 is a flow chart of a data hierarchical classification method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a data hierarchical classification method according to another embodiment of the present disclosure;

FIG. 3 is a flow chart of a data hierarchical classification method according to another embodiment of the present disclosure;

FIG. 4 is a flow chart of a data hierarchical classification method according to another embodiment of the present disclosure;

FIG. 5 is a graph of the comparative effect of predicted and actual values for two modeling schemes provided by another embodiment of the present disclosure;

FIG. 6 is a graph of error versus result for three modeling schemes provided in accordance with another embodiment of the present disclosure;

FIG. 7 is a graphical illustration of a visual result of a test dataset provided by another embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a data hierarchical classification device according to another embodiment of the disclosure;

Fig. 9 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present disclosure, numerous technical details have been set forth in order to provide a better understanding of the present disclosure. However, the technical solutions claimed in the present disclosure can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following divisions of the various embodiments are for convenience of description, and should not be construed as limiting the specific implementations of the disclosure, and the various embodiments may be mutually combined and referred to without contradiction.

One embodiment of the present disclosure relates to a data hierarchical classification method applied to hull profile design, the flow of which is shown in fig. 1, comprising the following steps:

s101: pre-segmentation of the original dataset: and carrying out clustering and layering treatment on the original data set by adopting a Gaussian mixture model according to the sample classification number designated by the user, and dividing the original data set into a plurality of sub data sets corresponding to the sample classification number to obtain a current data dividing scheme, wherein the original data set is an industrial data set in the ship profile design.

Specifically, for the original data set D, the original data set D may be divided into split_n sub-data sets (D) according to the sample classification number split_n by using an unsupervised clustering algorithm gaussian mixture model ₁ ，D ₂ ，...，D _{split_n} ) And obtaining the current data segmentation scheme. The current data segmentation scheme is to adopt a Gaussian mixture model to perform clustering and layering processing on an original data set according to a sample classification number split_n specified by a user, and segment the original data set into split_n sub-data sets.

S102: classifying the training sub-data set: and adding a sub-class label to each sub-data set to obtain a training data set, and training the random forest classifier based on the training data set to obtain a sub-class classifier.

Specifically, the subclass label is noted as table=1, 2,..split-n, and the training dataset is noted as D _{_splited} Then is a sub-data set (D ₁ ，D ₂ ，...，D _{split_n} ) After adding the subclass labels, the resulting training dataset may be represented as D _{_splited} ＝{(D ₁ ，lable＝1)，(D ₂ ，lable＝2)，...，(D _split-n Table=split-n) }. Based on the training data set, training is carried out by adopting a random forest classifier in a supervised algorithm to obtain a subclass classifier, and the subclass classifier is used as a middle-stage classifier for judging the subclass attribution of the new data sample so as to determine a regression model to be activated.

The random forest classifier is a special guided clustering algorithm (Bootstrap aggregating, bagging algorithm, also called bagging algorithm) which uses the decision tree CART algorithm as primitive model in the bagging strategy. Firstly, generating m training sets on a primary data set by a self-help sampling integration method, then, constructing an independent decision tree for each training set, and when a node finds a feature to split, not finding all the features to enable an index (such as information gain) to be maximum, but randomly extracting a part of the features, finding an optimal solution among the extracted features, applying the optimal solution to the node, and splitting. The random forest method is equivalent to sampling both samples and features, so that the problem of overfitting can be effectively avoided.

S103: verification of the data segmentation scheme: and respectively carrying out regression training on the original data set and the plurality of sub-data sets by using a regression algorithm model based on the gradient lifting decision tree to obtain an original data set regression model and a plurality of sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set respectively, and the proxy performance of the current data segmentation scheme and the proxy performance of the original data set are respectively determined by combining a sub-class classifier and cross verification.

Specifically, the step can utilize a regression algorithm model based on a gradient lifting decision tree to respectively determine the initial dataset D and the split_n sub-datasets (D ₁ ，D ₂ ，...，D _{split_n} ) Performing regression training to obtain an original data set regression model escriber _baseline And split_n sub-dataset regression model evatimator ₁ ，estimator ₂ ，...，estimator _{split_n} . And taking the cross verification as a flow logic to respectively obtain the proxy performance split_perf of the current data segmentation scheme and the proxy performance baseline_perf of the original data set, thereby verifying whether the current data segmentation scheme can effectively promote modeling regression effects.

S104: selecting a final data segmentation scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

Specifically, the current data partitioning scheme is evaluated according to the proxy performance split_perf of the current data partitioning scheme and the proxy performance baseline_perf of the original data set, and the sizes of the split_perf and baseline_perf are compared. If the split_perf > basejperf is satisfied, the performance of the current data partitioning scheme is higher, the partitioning is effective, and the sub-data set (D ₁ ，D ₂ ，...，D _{split_n} ) The sub-data set (D ₁ ，D ₂ ，...，D _{split_n} ) And the final data segmentation result is obtained. If the split_perf > basejperf is not satisfied, the performance of the current data segmentation scheme is lower, the segmentation is invalid, and the original data set D is output.

Compared with the prior art, the method and the device are based on objective limitation of the data scale of the industrial design problem, aim at the problems that multiple mixed modes exist in the industrial design section data set or the consistency inside the data set is poor, perform pre-processing on the data set by using a data layering classification method in the ship profile design driven by the industrial data set for the first time, purify the data set quality by mining multiple mixed modes inside the sample training set through pre-layering operation, improve the precision of data modeling, improve the utilization rate of designers on the accumulated ship profile data of enterprises, have wide application range, and effectively assist the intelligent design of the ship profile.

Illustratively, before step S101, an acquiring step may be further included, that is, acquiring the number of sample classifications specified by the user and the original data set.

Exemplary cross-validation includes K-fold (K-fold) cross-validation.

Specifically, the basic idea of K-fold cross-validation is that the initial sample is split into K subsamples, one individual subsamples is reserved as data for the validation model, and the other K-1 samples are used for training. The cross-validation is repeated K times, each sub-sample is validated once, the K results are averaged or other combinations are used to finally obtain a single estimate. The K-fold cross verification has the advantages that training and verification are performed by repeatedly applying randomly generated subsamples, each time of result verification is performed once, all samples of the training set are necessarily training data, meanwhile, the training set is also necessarily opportunistically used as a test set, and the training set data can be better utilized. Of these, K is generally 2-10, 10 fold cross-validation being most commonly used.

Illustratively, after obtaining the original dataset regression model and the plurality of sub-dataset regression models, determining the proxy performance of the current data splitting scheme and the proxy performance of the original dataset, respectively, in combination with the sub-class classifier and cross-validation, includes:

dividing the original data set into K original data subsets randomly, taking one of the original data subsets as a test set in turn, taking the corresponding rest of the original data subsets as a training set, training and testing the multiple sub-data set regression models and the original data set regression models based on the training set, the test set and the subclass classifier, and respectively obtaining errors corresponding to the multiple sub-data set regression models and errors corresponding to the original data set regression models.

Specifically, the original data set D is randomly equally divided into K original data subsets (D' ₁ ，D′ ₂ ，...，D′ _K ) Take turns D' ₁ ，D′ ₂ ，...，D′ _K As test set, the corresponding remaining original data subset, i.e. (D' ₂ ，D′ ₃ ，...，D′ _K )，(D′ ₁ ，D′ ₃ ，...，D′ _K )，...，(D′ ₁ ，D′ ₂ ，...，D′ _K-1 ) As a training set, a subset classifier is used for regressing model escriptor of the split_n sub-data sets ₁ ，estimator ₂ ，...，estimator _{split_n} And original dataset regression model estimator _{baselin_e} Training and testing are carried out, and errors corresponding to the split_n sub-data set regression model and errors corresponding to the original data set regression model are obtained respectively.

The training set data can be better utilized by carrying out K-fold cross validation on the current data segmentation scheme, and the obtained evaluation result, namely the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, can be as close as possible to the performance of the model on the test set.

Exemplary, training and testing the multiple sub-dataset regression models and the original dataset regression model based on the training set, the test set and the sub-classifier to obtain errors corresponding to the multiple sub-dataset regression models and errors corresponding to the original dataset regression models, respectively, including the following steps, as shown in fig. 2:

s201: training a plurality of sub-dataset regression models based on the training set;

s202: judging the subclass category of each sample in the test set based on the subclass classifier, determining a sub-data set regression model corresponding to each sample from the trained multiple sub-data set regression models based on the judged subclass category, and respectively inputting each sample in the test set into the corresponding sub-data set regression model to obtain a predicted value corresponding to each sample;

S203: and determining errors corresponding to the regression models of the plurality of sub-data sets based on the true values and the corresponding predicted values of the samples in the test set.

Specifically, since K original data subsets are required to be used as test sets in turn, and the corresponding remaining original data subsets are required to be used as training sets, the original data subsets D 'are required to be respectively used' ₁ ，D′ ₂ ，...，D′ _K As test set, the corresponding remaining original data subset, i.e. (D' ₂ ，D′ ₃ ，...，D′ _K )，(D′ ₁ ，D′ ₃ ，...，D′ _K )，...，(D′ ₁ ，D′ ₂ ，...，D′ _K-1 ) And (3) repeating the steps S201 to S203 for K times as training sets to obtain errors of the split_n sub-data set regression model corresponding to each test set.

The training and testing are performed on the multiple sub-dataset regression models and the original dataset regression model based on the training set, the testing set and the sub-category classifier, so as to obtain errors corresponding to the multiple sub-dataset regression models and errors corresponding to the original dataset regression models, respectively, and the method further comprises the following steps, as shown in fig. 3:

s301: training a regression model of the original data set based on the training set;

s302: respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;

s303: and determining errors corresponding to the regression model of the original data set based on the true values of the samples in the test set and the corresponding predicted values thereof.

Specifically, since K original data subsets are required to be used as test sets in turn, and the corresponding remaining original data subsets are required to be used as training sets, the original data subsets D 'are required to be respectively used' ₁ ，D′ ₂ ，...，D′ _K As test set, the corresponding remaining original data subset, i.e. (D' ₂ ，D′ ₃ ，...，D′ _K )，(D′ ₁ ，D′ ₃ ，...，D′ _K )，...，(D′ ₁ ，D′ ₂ ，...，D′ _K-1 ) And (3) repeating the steps S301 to S303 for K times as training sets to obtain errors of the regression model of the original data set, which correspond to each test set respectively.

Illustratively, determining the error corresponding to the regression model for the plurality of sub-data sets based on the true values and their corresponding predicted values for each sample in the test set includes:

where j=1, 2,..k is the number of the test set, E _j For a plurality of sub-dataset regression models, i=1, 2,..n is the sample number in test set j, n is the number of samples in test set j, y _i To test the true value of the ith sample in set j,and the predicted value corresponding to the ith sample in the test set j.

Incidentally, will E _j Replacing the relative average absolute error of the regression model of the original data set and the test set j with y _i Substitution with the true value of the ith sample in test set j will And replacing the model with a predicted value corresponding to the ith sample in the test set j to obtain an error corresponding to the regression model of the original data set.

Illustratively, determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub-data set regression models and the errors corresponding to the original data set regression models, respectively, includes:

wherein split is the proxy performance of the current data partitioning scheme.

It should be noted that, replacing split with proxy performance of original dataset, E _j And replacing the relative average absolute error of the regression model of the original data set and the corresponding relative average absolute error of the test set j to obtain the proxy performance of the original data set.

Illustratively, a random forest classifier is built based on a classification regression tree model.

Specifically, the principle of the classification regression tree CART model is as follows:

input: training a data set;

and (3) outputting: a classification regression tree f (x);

in the input space where the training data set is located, recursively dividing each region into two sub-regions, determining output values on each sub-region, and constructing a binary decision tree:

1) Selecting an optimal segmentation variable j and a segmentation point s, and solving:

Traversing the variable j, scanning the segmentation point s for the fixed segmentation variable j, and selecting the pair (j, s) which enables the upper part to reach the minimum value;

2) Dividing the regions by the selected pairs (j, s) and determining the corresponding output values:

R ₁ (j，s)＝x|x ^(j) ≤s，R ₂ (j，s)＝x|x ^(j) ＞s

3) Continuing to call the steps 1) and 2) on the two sub-areas until a stopping condition is met;

4) Dividing the input space into M regions R ₁ ，R ₂ ，...，R _M Generating a decision tree:

illustratively, the gaussian mixture model GMM in step S101 is a linear combination of a plurality of gaussian distribution functions, the formula of which is:

wherein (mu) _k ，∑ _k ) Is the parameter of the k-th Gaussian distribution function, pi _k The probability of being selected as the kth class for the current point. The idea core of the GMM algorithm is: by adjusting (pi) _k ，μ _k ，∑ _k ) And combining parameters to make the likelihood probability of the GMM model obtained on the current data set maximum, wherein the likelihood probability calculation formula is as follows:

the GMM algorithm solving process involves the use of a maximum expected EM (Expectation-Maximization) algorithm, which is divided into two steps, the first step of solving for the rough value of the parameter to be estimated, and the second step of maximizing the likelihood function using the parameter estimation value of the first step. Introduction of an intermediate implicit variable gamma (z _nk ) Which represents the nth point x _n Posterior probability belonging to the k-th class:

according to M steps (Maximization step, M-step) of the EM algorithm, partial derivatives of (pi, mu, sigma) parameters are calculated aiming at likelihood probability and are set to 0, and the following calculation formula is obtained:

Wherein:

recalculating the log likelihood function of the GMM model based on the updated (pi, μ, Σ) parameters, namely:

checking whether the parameters (pi, mu, sigma) are converged or whether the log likelihood function is converged, and if not, repeating the iterative process. Thus, the GMM-based iterative correction logic can acquire a mixed distribution statistical model of the current mixed data set, and the classification of the training set samples is realized based on the mixed distribution statistical model.

Illustratively, the gradient-lifting decision tree GBDT algorithm of step S103 is an iterative decision tree algorithm. The algorithm is an additive combination of a series of regression trees (CART): the predicted result and the target are 'residual' before the latter tree is fitted, and the results of all the trees are accumulated to obtain the final answer. The principle of the GBDT algorithm is as follows:

1) Initializing a weak learner:

2) For m=1, 2, M has:

(a) For each sample i=1, 2,.. negative gradients are calculated, i.e. residuals:

(b) Taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x _i ，r _mi ) I=1, 2, N is used as training data for the next tree to obtain a new regression tree f _m (x) The corresponding leaf node area is R _jm J=1, 2. Wherein J is the number of leaf nodes of the regression tree t.

(c) For leaf area j=1, 2,.,. J calculates the best fit value:

(d) Updating the strong learner:

3) Obtaining a final learner:

in order to enable a person skilled in the art to better understand the above embodiments, a specific example will be described below.

As shown in fig. 4, a data layering classification method is applied to hull profile design, and comprises the following steps:

pre-segmentation of the original dataset: according to the parameters configured by the user, namely the sample classification number n, the GMM is adopted to perform clustering and layering operation on the original data set D, the original data set D is divided into n sub-data sets, and the sub-data set (D ₁ ，D ₂ ，...，D _n ) Obtaining a current data segmentation scheme, wherein an original data set is an industrial data set in hull molded line design;

classifying the training sub-data set: respectively n sub-data sets (D ₁ ，D ₂ ，...，D _n ) Adding subclass labels to obtain training data set (D ₁ ，lable＝1)，(D ₂ ，lable＝2)，...，(D _n Table=n), performing RF training on the data classification identifier based on the training data set to obtain a subclass classifier;

verification of the data segmentation scheme: performing predictor (arrival) training on the sub-data set proxy by using a GBDT-based regression algorithm model to obtain n sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set respectively, and performing base classifier training, namely base_arrival training, on an undivided (No-split) original data set, namely a full data set proxy to obtain an original data set regression model; testing a current segmentation scheme by adopting K-fold Cross validation (Cross-validation) to obtain proxy performance split_perf of the current segmentation scheme, and obtaining the proxy performance baseline_perf of an original scheme, namely an original data set, through the same K-fold Cross validation;

Selecting a final data segmentation scheme: judging whether or notSatisfies split_perf > baseline_perf, if so, the current segmentation scheme is effective, and a sub-data set (D ₁ ，D ₂ ，...，D _n ) If not, the current segmentation scheme is invalid, and the original data set D is output.

Test verification is carried out on the data layering classification method shown in fig. 4, and the original data set and the experimental result are as follows:

1) Raw dataset description: selecting a test data set containing 2000 samples for verification, wherein the design parameters are x respectively ₁ ，x ₂ ，x ₃ The target parameter is y.

2) Parameter setting: the number of sub-data sets to be segmented is set to 2, namely the sample classification number n is set to 2, which means that the original data set needs to be segmented into 2 sub-data sets, the number of folds of K fold cross validation is set to 10, and super-parametric optimization of the model is started so as to perform segmentation independent modeling work on a more accurate GBDT tree model, and therefore whether the modeling precision can be effectively improved by the current data hierarchical classification operation is judged more objectively.

3) Evaluation index: RMAE was selected as an evaluation index for evaluating the performance of the model, which was defined as follows:

where i=1, 2..n is the sample number, n is the number of samples, y _i For the true value of the sample i,the predicted value corresponding to sample i. The smaller the RMAE, the higher the accuracy of the model.

4) Experimental results: the data layering classification method is visual in operation logic, and for a test data set, performance improvement shown in table 1 is obtained before the super-parameter optimization function is not started:

TABLE 1 regression model errors based on data hierarchical classification method

Type(s)	RMAE value
		baseline_estimator	9.81％
estimators(n＝2)	3.64％

Namely, on the test data set, by obtaining the conservative performance estimation of the model estimation accuracy by aiming at n=2 and through the same cross-validation operation, it can be found that the model performance can be greatly improved by only introducing the data hierarchical classification method without increasing the data scale and changing the machine learning algorithm, and the original nearly 10% estimation error is reduced to 3.64%. The comparison between the predicted value and the true value of the two models is shown in fig. 5, and the basic scheme in fig. 5 refers to a scheme obtained by performing base classifier training on the original data set, and the segmentation scheme refers to the current segmentation scheme.

Modifying the sample classification number n, starting the preferred segmentation function of the hierarchical classification, and finding that the hierarchical classification method recommends a segmentation scheme of n=3, and obtaining the performance statistics as shown in the following table 2:

Table 2 regression model error based on data hierarchical classification method after starting the preferred segmentation function

Type(s)	RMAE value
		baseline_estimator	9.81％
estimators(n＝3)	2.79％

Compared with the segmentation scheme provided by the user, the intelligent segmentation function adopting the data layering classification method can provide further performance mining and improvement. Error pairs of three regression models are shown in fig. 6, wherein the basic model refers to a regression model trained without segmenting the original dataset; the user-specified data segmentation model is a regression model obtained by training after the original data set is segmented by a scheme specified by the user; the intelligent segmentation model is a regression model obtained by training an original data set after being intelligently segmented by a data layering classification method.

The visual result of the test data set shown in fig. 7 also shows that a plurality of sub-class modes obviously exist in the test data set, and the data layering classification method successfully improves the accuracy of data modeling through two sub-class modes of 'divide and conquer'.

Another embodiment of the present disclosure relates to a data hierarchical classification device applied to hull line design, as shown in fig. 8, comprising:

a pre-segmentation module 801, configured to pre-segment an original data set: according to the sample classification number designated by a user, clustering and layering processing is carried out on an original data set by adopting a Gaussian mixture model, and the original data set is divided into a plurality of sub data sets corresponding to the sample classification number, so that a current data segmentation scheme is obtained, wherein the original data set is an industrial data set in a ship profile design;

A classification training module 802 for classifying training sub-data sets: adding a sub-class label to each sub-data set to obtain a training data set, and training the random forest classifier based on the training data set to obtain a sub-class classifier;

a verification module 803, configured to verify the data segmentation scheme: carrying out regression training on an original data set and a plurality of sub-data sets by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a sub-class classifier and cross verification;

a selection module 804, configured to select a final data segmentation scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

The specific implementation method of the data hierarchical classification device provided by the embodiment of the present disclosure may be described with reference to the data hierarchical classification method provided by the embodiment of the present disclosure, which is not described herein again.

Another embodiment of the present disclosure relates to an electronic device, as shown in fig. 9, comprising:

at least one processor 901; the method comprises the steps of,

a memory 902 communicatively coupled to the at least one processor 901; wherein,

the memory 902 stores instructions executable by the at least one processor 901 to enable the at least one processor 901 to perform the data hierarchical classification method described in the above embodiments.

Where the memory and the processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors and the memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over the wireless medium via the antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory may be used to store data used by the processor in performing operations.

Another embodiment of the present disclosure relates to a computer readable storage medium storing a computer program which, when executed by a processor, implements the data hierarchical classification method described in the above embodiment.

That is, it will be understood by those skilled in the art that all or part of the steps of the method described in the above embodiments may be implemented by a program stored in a storage medium, including several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the various embodiments of the disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a ROM (Read-Only Memory), a RAM (Random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for carrying out the present disclosure, and that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure.

Claims

1. The data layering classification method is applied to ship profile design and is characterized by comprising the following steps of:

pre-segmentation of the original dataset: according to the sample classification number specified by a user, clustering and layering processing is carried out on an original data set by adopting a Gaussian mixture model, and the original data set is divided into a plurality of sub-data sets corresponding to the sample classification number, so that a current data segmentation scheme is obtained, wherein the original data set is an industrial data set in a ship profile design;

classifying the training sub-data set: adding sub-class labels to each sub-data set respectively to obtain a training data set, and training the random forest classifier based on the training data set to obtain a sub-class classifier;

verification of the data segmentation scheme: performing regression training on the original data set and the plurality of sub-data sets by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set, and the agent performance of the current data segmentation scheme and the agent performance of the original data set are respectively determined by combining the sub-class classifier and cross verification;

Selecting a final data segmentation scheme: evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result;

the cross-validation includes K-fold cross-validation, and after obtaining an original dataset regression model and a plurality of sub dataset regression models, the combining the sub-class classifier and the cross-validation respectively determines a proxy performance of the current data segmentation scheme and a proxy performance of the original dataset, including:

randomly dividing the original data set into K original data subsets, taking one of the original data subsets as a test set in turn, taking the corresponding rest of the original data subsets as a training set, training and testing the multiple sub-data set regression models and the original data set regression models based on the training set, the test set and the subclass classifier, and respectively obtaining errors corresponding to the multiple sub-data set regression models and errors corresponding to the original data set regression models;

determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively based on errors corresponding to the multiple sub-data set regression models and errors corresponding to the original data set regression models;

Training and testing the multiple sub-dataset regression models and the original dataset regression model based on the training set, the testing set and the sub-category classifier to respectively obtain errors corresponding to the multiple sub-dataset regression models and errors corresponding to the original dataset regression models, wherein the training and testing comprises the following steps:

training the plurality of sub-dataset regression models based on the training set;

determining errors corresponding to the multiple sub-dataset regression models based on the true values of the samples in the test set and the predicted values corresponding to the true values;

the determining the errors corresponding to the multiple sub-dataset regression models based on the true values of the samples in the test set and the predicted values corresponding to the true values comprises:

Determining errors corresponding to the multiple sub-dataset regression models according to the following equation (1):

where j=1, 2, …, K is the number of the test set, E _j For the relative average absolute errors of the multiple sub-dataset regression models corresponding to the test set j, i=1, 2, …, n is the sample number in the test set j, n is the number of samples in the test set j, y _i To test the true value of the ith sample in set j,the predicted value corresponding to the ith sample in the test set j;

the determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub-data set regression models and the errors corresponding to the original data set regression models respectively includes:

wherein split is the proxy performance of the current data partitioning scheme.

2. The method of claim 1, wherein training and testing the plurality of sub-dataset regression models and the original dataset regression model based on the training set, the testing set, and the sub-class classifier, respectively, results in errors corresponding to the plurality of sub-dataset regression models and errors corresponding to the original dataset regression model, further comprising:

Training the original dataset regression model based on the training set;

and determining errors corresponding to the regression model of the original dataset based on the true values of the samples in the test set and the corresponding predicted values thereof.

3. A method according to claim 1 or 2, wherein the random forest classifier is built based on a classification regression tree model.

4. A data hierarchical classification device for use in hull form design, the device comprising:

the front segmentation module is used for front segmentation of the original data set: according to the sample classification number specified by a user, clustering and layering processing is carried out on an original data set by adopting a Gaussian mixture model, and the original data set is divided into a plurality of sub-data sets corresponding to the sample classification number, so that a current data segmentation scheme is obtained, wherein the original data set is an industrial data set in a ship profile design;

the classification training module is used for classifying and training the sub-data set: adding sub-class labels to each sub-data set respectively to obtain a training data set, and training the random forest classifier based on the training data set to obtain a sub-class classifier;

The verification module is used for verifying the data segmentation scheme: performing regression training on the original data set and the plurality of sub-data sets by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of sub-data set regression models, wherein each sub-data set regression model corresponds to one sub-data set, and the agent performance of the current data segmentation scheme and the agent performance of the original data set are respectively determined by combining the sub-class classifier and cross verification;

a selection module, configured to select a final data segmentation scheme: evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result;

the cross-validation includes K-fold cross-validation, after obtaining an original dataset regression model and a plurality of sub dataset regression models, the validation module is configured to combine the sub-class classifier and the cross-validation to determine a proxy performance of the current data segmentation scheme and a proxy performance of the original dataset, respectively, including:

The verification module is used for dividing the original data set into K original data subsets at random, taking one of the original data subsets as a test set in turn, taking the corresponding other original data subsets as a training set, training and testing the multiple sub-data set regression models and the original data set regression model based on the training set, the test set and the subclass classifier, and respectively obtaining errors corresponding to the multiple sub-data set regression models and errors corresponding to the original data set regression model; determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively based on errors corresponding to the multiple sub-data set regression models and errors corresponding to the original data set regression models;

the verification module is configured to train and test the multiple sub-dataset regression models and the original dataset regression model based on the training set, the test set and the sub-category classifier, to obtain errors corresponding to the multiple sub-dataset regression models and errors corresponding to the original dataset regression models, respectively, and includes:

the verification module is used for training the multiple sub-dataset regression models based on the training set; judging the subclass category of each sample in the test set based on the subclass classifier, determining a sub-data set regression model corresponding to each sample from the trained multiple sub-data set regression models based on the judged subclass category, and respectively inputting each sample in the test set into the corresponding sub-data set regression model to obtain a predicted value corresponding to each sample; determining errors corresponding to the multiple sub-dataset regression models based on the true values of the samples in the test set and the predicted values corresponding to the true values;

The verification module is configured to determine errors corresponding to the multiple sub-dataset regression models based on the true values of the samples in the test set and the predicted values corresponding to the true values, and includes:

the verification module is configured to determine errors corresponding to the regression models of the multiple sub-data sets according to the following formula (1):

the verification module is configured to determine, based on the errors corresponding to the multiple sub-dataset regression models and the errors corresponding to the original dataset regression models, a proxy performance of the current data segmentation scheme and a proxy performance of the original dataset, respectively, including:

the verification module is configured to determine the proxy performance of the current data partitioning scheme according to the following formula (2):

wherein split is the proxy performance of the current data partitioning scheme.

5. An electronic device, comprising:

At least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data hierarchical classification method of any one of claims 1 to 3.

6. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the data hierarchical classification method of any one of claims 1 to 3.