CN115600121A

CN115600121A - Data hierarchical classification method and device, electronic equipment and storage medium

Info

Publication number: CN115600121A
Application number: CN202210446117.7A
Authority: CN
Inventors: 张明; 张儒; 郭震; 金云峰; 孙自飞; 甘雨; 路明标; 姜栋
Original assignee: Nanjing Tianfu Software Co ltd
Current assignee: Nanjing Tianfu Software Co ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2023-01-13
Anticipated expiration: 2042-04-26
Also published as: CN115600121B

Abstract

The disclosure relates to the technical field of data processing in hull profile design, and provides a data hierarchical classification method and device, electronic equipment and a storage medium, which are applied to hull profile design, wherein the method comprises the following steps: s101, pre-segmenting an original data set; s102, classifying the training subdata set; s103, verifying a data segmentation scheme; s104 selects a final data partitioning scheme. Based on objective limitation of data scale in industrial design, aiming at the problem that various mixed modes exist in industrial design data set or the consistency in the data set is poor, the data set is preprocessed by using a data layering classification method in the design of hull type lines driven by the industrial data set for the first time, various mixed modes in a sample training set are mined, the quality of the data set is purified by means of preposed layering operation, the accuracy of data modeling is improved, the utilization rate of accumulated hull type data by designers is improved, the application range is wide, and the intelligent design of the hull type lines is effectively assisted.

Description

Data hierarchical classification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies in hull profile design, and in particular, to a data hierarchical classification method and apparatus, an electronic device, and a storage medium.

Background

The traditional ship design is an empirical design in nature, and the final decision result depends on the subjective experience and knowledge structure level of a decision maker to a great extent. The expert consulting method (Delphi) takes subjective judgment of experts as a decision basis, takes scores, indexes, ordinal numbers, comments and the like as evaluation criteria, is a simple method lacking in theories and systematicness, and is difficult to guarantee objective authenticity of an evaluation result. An Analytic Hierarchy Process (AHP) is used for researching a multi-target decision problem with a more complex structure, and can quantify a qualitative problem, so that an evaluation result is more scientific and reasonable. The method obtains the judgment matrix reflecting the relative importance of each attribute by comparing the evaluation indexes pairwise, so that the reliability is high, the error is small, and the defects that the judgment matrix cannot meet the requirement of consistency easily due to the limitation of the knowledge structure, personal preference, judgment level and the like of a decision maker.

With the advent and development of advanced intelligent technology, scientific and reasonable Decision-making modes are introduced into the design process, and the concept of Decision Support Systems (DSS) pushes Decision-making theory to a new high-grade development, which has achieved great achievements in the fields of System engineering, management science, and the like, and is often used for solving the Decision-making problem of semi-structured and unstructured complex information systems. In recent years, the emergence of advanced intelligent technologies such as Online Analytical Processing (OLAP) and Data Mining (DM) based on Data Warehouse (DW) opens up a new approach for the development of DSS.

The ship type decision support system comprises a database and database management module, a model base and model base management module, a knowledge base and knowledge base management module, a data warehouse and database management module, a data mining module, a knowledge discovery module, a man-machine interaction module and the like. The data mining module and the knowledge discovery module are responsible for carrying out operations such as query, analysis, mining, selection and evaluation on data, and mining out decision information hidden in the data by adopting intelligent technologies such as genetic algorithm, neural network, statistical analysis, machine learning and fuzzy decision.

The intelligent technology is mostly driven by data, and how to utilize the ship type data accumulated by the enterprise to provide efficient reference for designers is the main research content of data mining. In the prior art, the ship type highly related to the design requirement is selected from the accumulated ship type data as a female type to guide the ship type design, but the utilization rate of the ship type data is extremely low, only excellent ship type data highly related to the design requirement is utilized, and the mutual relation among the selected ship types is not considered.

The introduction of an agent model training technology based on an artificial intelligence technology is one of the key technologies for solving the problems. Considering that the ship model test or the actual measurement data is limited, the training sample of the agent model may be a simulation data sample provided by a Computational Fluid Dynamics (CFD) solution tool, and the test or actual measurement data may be used to correct the CFD solution model or the boundary condition. Through the technology, most data in the ship type database can be utilized, so that a designer is guided to carry out ship type design, and the utilization rate of the ship type data is greatly improved. Meanwhile, the evaluation time of the agent model is far shorter than the CFD simulation calculation time, and the engineering design period can be greatly shortened by using the agent model.

The agent model training technology based on the artificial intelligence technology can effectively solve the problems of long data utilization rate and design period, but the training and the use of the agent model also have some problems. For example, the difficulty of improving the training precision of the proxy model is increased due to the limited number of data training samples, the consistency of sample point classes, and the like, and particularly in the industrial data set data-driven learning problem, the industrial process is particularly under the condition that the data value of the former 'design segment' shows high single-point value density and small data set scale. At the moment, the learning algorithms such as machine learning generally have the problems of data shortage (data-hungry) and dimension disaster (dimension-cure), namely, the stronger the nonlinear expression capability of the model is, the higher the requirements on the scale and diversity of the training data are; however, the algorithm model with general nonlinear expression capability cannot effectively extract the complex mapping mode in the training data set, and the model effect is difficult to bear relevant application.

Due to objective limitation of the scale of the data set in the industrial design stage, complex models such as deep learning cannot be effectively activated and used, only machine learning algorithms with stronger statistical attributes can be used, but the nonlinear expression capability of the machine learning algorithms is limited, and particularly under the condition that the data set in the industrial design stage has various mixed modes or the consistency in the data set is poor, the modeling effect of the learning algorithms is further weakened.

Disclosure of Invention

The present disclosure is directed to at least one of the problems in the prior art, and provides a data hierarchical classification method and apparatus, an electronic device, and a storage medium.

One aspect of the present disclosure provides a data hierarchical classification method, which is applied to hull contour design, and includes the following steps:

pre-segmentation of the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian Mixture Model (GMM), and segmenting the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data segmentation scheme, wherein the original data set is an industrial data set in hull molded line design;

classifying the training subdata set: respectively adding a subclass label to each subdata set to obtain a training data set, and training a Random Forest (RF) classifier based on the training data set to obtain a subclass classifier;

verifying the data partitioning scheme: performing regression training on an original data set and a plurality of sub data sets respectively by using a regression algorithm model based on a Gradient Boosting Decision Tree (GBDT) to obtain an original data set regression model and a plurality of sub data set regression models, wherein each sub data set regression model corresponds to one sub data set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation;

selecting a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

Optionally, the cross validation includes K-fold cross validation, and after obtaining the regression model of the original data set and the regression models of the multiple sub data sets, determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set respectively by combining the subclass classifier and the cross validation, including:

dividing an original data set into K original data subsets randomly and equally, taking one original data subset as a test set in turn and taking the other corresponding original data subsets as a training set, training and testing a plurality of sub data set regression models and an original data set regression model based on the training set, the test set and a subclass classifier, and respectively obtaining errors corresponding to the plurality of sub data set regression models and errors corresponding to the original data set regression model;

and respectively determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub data set regression models and the errors corresponding to the original data set regression model.

Optionally, training and testing the multiple regression sub-data sets and the regression original data set model based on the training set, the testing set, and the subclass classifier to obtain errors corresponding to the multiple regression sub-data sets and errors corresponding to the regression original data set, respectively, including:

training a plurality of subdata set regression models based on the training set;

judging the subclass category of each sample in the test set based on a subclass classifier, determining a subdata set regression model corresponding to each sample from a plurality of trained subdata set regression models based on the judged subclass category, and inputting each sample in the test set into the corresponding subdata set regression model respectively to obtain a predicted value corresponding to each sample;

and determining errors corresponding to the regression models of the plurality of subdata sets based on the real values and the corresponding predicted values of the samples in the test set.

Optionally, determining errors corresponding to the regression models of the multiple sub data sets based on the true values and the predicted values corresponding to the true values of the samples in the test set, including:

determining errors corresponding to the regression models of the plurality of subdata sets according to the following formula (1):

wherein j =1,2, \ 8230;, K is the test set number, E _j The method comprises the steps of obtaining Relative Mean Absolute Error (RMAE) corresponding to a plurality of sub data set regression models and a test set j, wherein i =1,2, \ 8230, n is the sample number in the test set j, and y is the sample number in the test set j _i To test the true value of the ith sample in set j,

and the predicted value corresponding to the ith sample in the test set j is obtained.

Optionally, the determining, based on errors corresponding to the multiple regression models of the subdata set and errors corresponding to the regression model of the original data set, a proxy performance of the current data partitioning scheme and a proxy performance of the original data set respectively includes:

determining a proxy performance of the current data splitting scheme according to the following equation (2):

wherein, split perf is the proxy performance of the current data splitting scheme.

Optionally, training and testing the multiple regression sub-data sets and the regression model of the original data set based on the training set, the testing set and the subclass classifier to obtain errors corresponding to the multiple regression sub-data sets and errors corresponding to the regression model of the original data set, respectively, further comprising:

training an original data set regression model based on the training set;

respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;

and determining the error corresponding to the regression model of the original data set based on the true value and the corresponding predicted value of each sample in the test set.

Optionally, the random forest classifier is established based on a Classification And Regression Tree (CART) model.

In another aspect of the present disclosure, a data hierarchical classification apparatus is provided, which is applied to hull profile design, and the apparatus includes:

the pre-segmentation module is used for pre-segmenting the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design;

the classification training module is used for classifying and training the subdata set: respectively adding a subclass label to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;

a verification module to verify a data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation;

a selection module to select a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

In another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for hierarchical classification of data as described above.

In another aspect of the disclosure, a computer-readable storage medium is provided, in which a computer program is stored, and the computer program is executed by a processor to implement the data hierarchical classification method described above.

Compared with the prior art, the method is based on objective limitation of data scale of industrial design problems, aims at the problem that multiple mixed modes exist in the data set of an industrial design section or the consistency inside the data set is poor, performs preprocessing on the data set by using a data layering classification method in the hull profile design driven by the industrial data set for the first time, purifies the quality of the data set by adopting the preposed layering operation by mining the multiple mixed modes inside a sample training set, improves the precision of data modeling, improves the utilization rate of accumulated hull profile data of an enterprise by designers, has wide application range, and effectively assists the intelligent design of the hull profile.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a flowchart of a data hierarchical classification method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a data hierarchical classification method according to another embodiment of the present disclosure;

fig. 3 is a flowchart of a data hierarchical classification method according to another embodiment of the present disclosure;

fig. 4 is a flowchart of a data hierarchical classification method according to another embodiment of the present disclosure;

FIG. 5 is a graph illustrating the comparative effect of predicted values and actual values of two modeling schemes provided by another embodiment of the present disclosure;

FIG. 6 is a graph of error comparison results for three modeling solutions provided by another embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating a visualization result of a test data set according to another embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a data hierarchical classification apparatus according to another embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in various embodiments of the disclosure, numerous technical details are set forth in order to provide a better understanding of the disclosure. However, the technical solutions claimed in the present disclosure can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation of the present disclosure, and the embodiments may be mutually incorporated and referred to without contradiction.

One embodiment of the present disclosure relates to a data hierarchical classification method, which is applied to hull contour design, and the flow of the data hierarchical classification method is shown in fig. 1, and includes the following steps:

s101: pre-segmentation of the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design.

Specifically, for the original data set D, the unsupervised clustering algorithm gaussian mixture model may be utilized to divide the original data set D into split _ n sub-data sets according to the sample classification number split _ n (D) ₁ ，D ₂ ，...，D _{split_n} ) And obtaining the current data segmentation scheme. The current data segmentation scheme is that the original data set is subjected to clustering and layering processing by adopting a Gaussian mixture model according to the sample classification number split _ n specified by a user, and the original data set is segmented into split _ n sub-data sets.

S102: classifying the training subdata set: and respectively adding a subclass label to each subdata set to obtain a training data set, and training the random forest classifier based on the training data set to obtain a subclass classifier.

Specifically, the subclass label is denoted as table =1,2 _{_splited} And is the subdata set (D) ₁ ，D ₂ ，...，D _{split_n} ) After adding the subclass labels, the resulting training data set can be represented as D _{_splited} ＝{(D ₁ ，lable＝1)，(D ₂ ，lable＝2)，...，(D _split-n Able = split-n). Based on a training data set, a random forest classifier in a supervised algorithm is adopted for training to obtain a subclass classifier, and the subclass classifier is used as a middle-stage classifier for judging subclass attribution of a newly-entered data sample so as to determine a regression model to be activated.

The random forest classifier is a special guided aggregation algorithm (bootstrapping aggregation, bagging algorithm), and uses a decision tree CART algorithm as an elementary model in a bagging strategy. Firstly, generating m training sets on a native data set by a self-service sampling integration method, then constructing an independent decision tree for each training set, when a node finds features to split, not finding all the features to enable indexes (such as information gain) to be maximum, but randomly extracting a part of features in the features, finding an optimal solution among the extracted features, and applying the optimal solution to the node to split. The random forest method is equivalent to sampling samples and features, so that the over-fitting problem can be effectively avoided.

S103: verifying the data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation.

Specifically, this step may use a regression algorithm model based on a gradient-boosting decision tree to separately perform a process on the original dataset D and the split _ n sub-datasets (D) ₁ ，D ₂ ，...，D _{split_n} ) Carrying out regression training to obtain a regression model estimator of the original data set _baseline And split _ n sub-dataset regression model estimator ₁ ，estimator ₂ ，...，estimator _{split_n} . And respectively obtaining the agent performance split _ perf of the current data segmentation scheme and the agent performance baseline _ perf of the original data set by taking cross validation as flow logic, thereby verifying whether the current data segmentation scheme can effectively improve the modeling regression effect.

S104: selecting a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to an evaluation result.

Specifically, the current data partitioning scheme is evaluated according to the proxy performance split _ perf of the current data partitioning scheme and the proxy performance baseline _ perf of the original data set, and the sizes of the split _ perf and the baseline _ perf are compared. If the split _ perf > baseline _ perf is satisfied, the performance of the current data partitioning scheme is higher, partitioning is effective, and a subdata set (D) is output ₁ ，D ₂ ，...，D _{split_n} ) The subdata set (D) ₁ ，D ₂ ，...，D _{split_n} ) Namely the final data segmentation result. If the split _ perf > baseline _ perf is not satisfied, the performance of the current data segmentation scheme is low, segmentation is invalid, and the original data set D is output.

Compared with the prior art, the data scale of the industrial design problem is objectively limited, the problem that multiple mixed modes exist in the data set of the industrial design section or the consistency inside the data set is poor is solved, the data set is preprocessed by using a data layering classification method in the hull profile design driven by the industrial data set for the first time, the quality of the data set is purified by mining the multiple mixed modes inside the sample training set through preposed layering operation, the accuracy of data modeling is improved, the utilization rate of accumulated hull profile data of an enterprise by designers is improved, the application range is wide, and the intelligent design of the hull profile is effectively assisted.

Illustratively, before step S101, an obtaining step of obtaining a sample classification number specified by a user and a raw data set may be further included.

Illustratively, cross-validation includes K-fold (K-fold) cross-validation.

Specifically, the basic idea of K-fold cross validation is that the initial sample is divided into K sub-samples, one individual sub-sample is retained as data for the validation model, and the other K-1 samples are used for training. Cross validation is repeated K times, each sub-sample is validated once, the K results are averaged or other combinations are used, and a single estimate is obtained. The K-fold cross validation has the advantages that the randomly generated sub-samples are repeatedly used for training and validation, the result of each time is validated once, all samples of the training set inevitably become training data and also inevitably have an opportunity to become a test set, and the training set data can be better utilized. Among them, K is generally taken as 2-10, and 10-fold cross validation is the most commonly used.

Illustratively, after obtaining the regression model of the original data set and the regression models of the plurality of sub data sets, respectively determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set by combining a subclass classifier and cross validation, the method includes:

and training and testing the multiple sub data set regression models and the original data set regression model based on the training set, the testing set and the subclass classifier to respectively obtain errors corresponding to the multiple sub data set regression models and errors corresponding to the original data set regression model.

Specifically, the original data set D is randomly equally divided into K original data subsets (D' ₁ ，D′ ₂ ，...，D′ _K ) D 'are taken in turn' ₁ ，D′ ₂ ，...，D′ _K As a test set, the corresponding remaining subset of raw data, namely (D' ₂ ，D′ ₃ ，...，D′ _K )，(D′ ₁ ，D′ ₃ ，...，D′ _K )，...，(D′ ₁ ，D′ ₂ ，...，D′ _K-1 ) As a training set, a subclass classifier pair is utilizedSplit _ n sub-dataset regression model estimator ₁ ，estimator ₂ ，...，estimator _{split_n} And the regression model estimator of the original data set _{baselin_e} And training and testing to respectively obtain the error corresponding to the regression model of the split _ n sub-data set and the error corresponding to the regression model of the original data set.

By performing K-fold cross validation on the current data segmentation scheme, training set data can be better utilized, and the obtained evaluation result, namely the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, can be as close as possible to the performance of the model on the test set.

Illustratively, training and testing the regression models of the multiple sub data sets and the regression model of the original data set based on the training set, the testing set and the subclass classifier to obtain errors corresponding to the regression models of the multiple sub data sets and errors corresponding to the regression model of the original data set, respectively, includes the following steps, as shown in fig. 2:

s201: training a plurality of subdata set regression models based on the training set;

s202: judging the subclass category of each sample in the test set based on a subclass classifier, determining a subdata set regression model corresponding to each sample from a plurality of trained subdata set regression models based on the judged subclass category, and inputting each sample in the test set into the corresponding subdata set regression model respectively to obtain a predicted value corresponding to each sample;

s203: and determining errors corresponding to the regression models of the plurality of subdata sets based on the real values of the samples in the test set and the predicted values corresponding to the real values.

Specifically, since K original data subsets need to be taken as test sets in turn and the corresponding remaining original data subsets need to be taken as training sets, the original data subsets D 'need to be respectively taken' ₁ ，D′ ₂ ，...，D′ _K As a test set, the corresponding remaining subset of raw data, namely (D' ₂ ，D′ ₃ ，...，D′ _K )，(D′ ₁ ，D′ ₃ ，...，D′ _K )，...，(D′ ₁ ，D′ ₂ ，...，D′ _K-1 ) And (5) repeating the steps S201 to S203 for K times as a training set to obtain errors of the split _ n sub-data set regression model corresponding to each test set.

Illustratively, training and testing the regression models of the multiple sub data sets and the regression model of the original data set based on the training set, the testing set and the subclass classifier to obtain errors corresponding to the regression models of the multiple sub data sets and errors corresponding to the regression model of the original data set, respectively, further includes the following steps, as shown in fig. 3:

s301: training an original data set regression model based on the training set;

s302: respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;

s303: and determining the error corresponding to the regression model of the original data set based on the true value and the corresponding predicted value of each sample in the test set.

Specifically, since K original data subsets need to be taken as test sets in turn and the corresponding remaining original data subsets need to be taken as training sets, the original data subsets D 'need to be respectively taken' ₁ ，D′ ₂ ，...，D′ _K As a test set, the corresponding remaining subset of raw data, namely (D' ₂ ，D′ ₃ ，...，D′ _K )，(D′ ₁ ，D′ ₃ ，...，D′ _K )，...，(D′ ₁ ，D′ ₂ ，...，D′ _K-1 ) And (5) repeating the steps S301 to S303 for K times as a training set to obtain the errors of the regression model of the original data set corresponding to each test set.

Illustratively, determining errors corresponding to the regression models of the plurality of subdata sets based on the real values and the corresponding predicted values of the samples in the test set includes:

wherein j =1, 2.. K is the test set number, E _j For the relative mean absolute error of the regression model of the multiple subdata sets corresponding to the test set j, i =1,2 _i To test the true value of the ith sample in set j,

In addition, E is _j Replacing the relative mean absolute error corresponding to the original data set regression model and the test set j, and replacing y _i Replacing the real value of the ith sample in the test set j with the real value

And replacing the test set j with a predicted value corresponding to the ith sample in the test set j to obtain an error corresponding to the regression model of the original data set.

Illustratively, determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set based on the errors corresponding to the regression models of the plurality of sub data sets and the errors corresponding to the regression model of the original data set, respectively, includes:

determining proxy performance of the current data partitioning scheme according to equation (2) below:

Note that the proxy capability of replacing the split perf with the original data set, and E _j Replace by the relative average of the original dataset regression model corresponding to test set jAnd absolute error, namely obtaining the proxy performance of the original data set.

Illustratively, the random forest classifier is built based on a classification regression tree model.

Specifically, the principle of classifying the regression tree CART model is as follows:

inputting: a training data set;

and (3) outputting: classifying the regression tree f (x);

in an input space where a training data set is located, recursively dividing each region into two sub-regions, determining an output value on each sub-region, and constructing a binary decision tree:

1) Selecting an optimal segmentation variable j and an optimal segmentation point s, and solving:

traversing the variable j, scanning a segmentation point s for the fixed segmentation variable j, and selecting a pair (j, s) which enables the above formula to reach the minimum value;

2) Dividing the region by the selected pair (j, s) and determining the corresponding output value:

R ₁ (j，s)＝x|x ^(j) ≤s，R ₂ (j，s)＝x|x ^(j) ＞s

3) Continuing to call the steps 1) and 2) for the two sub-areas until a stop condition is met;

4) Dividing an input space into M regions R ₁ ，R ₂ ，...，R _M Generating a decision tree:

illustratively, the gaussian mixture model GMM in step S101 is a linear combination of a plurality of gaussian distribution functions, and is formulated as:

wherein (mu) _k ，∑ _k ) Is a parameter of a k-th Gaussian distribution function, pi _k The probability of being selected as class k for the current point. The core of the idea of the GMM algorithm is as follows: by adjusting (pi) _k ，μ _k ，∑ _k ) Combining parameters to make GMM model obtain maximum likelihood probability on current data set, wherein the likelihood probability calculation formula is as follows:

the solution process of the GMM algorithm involves the use of a maximum Expectation-Maximization (EM) algorithm, which is divided into two steps, wherein the first step is to obtain a rough value of a parameter to be estimated, and the second step is to maximize a likelihood function by using a parameter estimation value of the first step. Introducing an intermediate implicit variable gamma (z) _nk ) Which represents the nth point x _n Posterior probability assigned to class k:

according to M steps (M-step) of the EM algorithm, partial derivatives of (pi, mu, sigma) parameters are solved according to likelihood probability and are juxtaposed to be 0, and the following calculation formula is obtained:

wherein:

recalculating the log-likelihood function of the GMM model based on the updated (pi, mu, sigma) parameters, i.e.:

checking whether the parameters (pi, mu and sigma) are converged or whether the log-likelihood function is converged, and if not, repeating the iteration process. To this end, based on the GMM iterative correction logic, a mixed distribution statistical model of the current mixed data set may be obtained, and based on the mixed distribution statistical model, classification of the training set samples is achieved.

Illustratively, the gradient boosting decision tree GBDT algorithm in step S103 is an iterative decision tree algorithm. The algorithm is an additive combination of a series of regression trees (CART): and (4) fitting the 'residual error' between the predicted result and the target before the next tree, and accumulating the results of all the trees to obtain a final answer. The principle of the GBDT algorithm is as follows:

1) Initializing the weak learner:

2) For M =1,2, M has:

(a) For each sample i =1,2, ·, N, a negative gradient, i.e. a residual, is calculated:

(b) Taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x) _i ，r _mi ) I =1, 2.. N, as training data of the next tree, a new regression tree f is obtained _m (x) The corresponding leaf node region is R _jm J =1,2.., J. Wherein J is the number of leaf nodes of the regression tree t.

(c) Calculate the best fit for leaf region J =1,2.,:

(d) Updating the strong learner:

3) Obtaining a final learner:

in order to enable those skilled in the art to better understand the above embodiments, a specific example is described below.

As shown in fig. 4, a data hierarchical classification method applied to hull contour design includes the following steps:

pre-segmentation of the original data set: according to a parameter configured by a user, namely a sample classification number n, adopting GMM to perform clustering and layering operation on an original data set D, dividing the original data set D into n sub-data sets, and outputting a sub-data set (D) obtained by dividing the original data set D ₁ ，D ₂ ，...，D _n ) Obtaining a current data segmentation scheme, wherein an original data set is an industrial data set in the hull molded line design;

classifying the training subdata set: n sub-data sets (D) respectively ₁ ，D ₂ ，...，D _n ) Adding subclass labels to obtain a training data set (D) ₁ ，lable＝1)，(D ₂ ，lable＝2)，...，(D _n Table = n), based on the training data set, performing RF training on the data classification identifier to obtain a subclass classifier;

verifying the data partitioning scheme: utilizing a GBDT-based regression algorithm model to train predictors (estimators) for the sub-dataset agents to obtain n sub-dataset regression models, wherein each sub-dataset regression model corresponds to one sub-dataset, and performing base classifier training (base _ estimator training) on an undivided (No-split) original dataset (namely, a full dataset agent) to obtain an original dataset regression model; testing the current segmentation scheme by adopting K-fold Cross validation (Cross-validation), obtaining the proxy performance split _ perf of the current segmentation scheme, and obtaining the original scheme, namely the proxy performance baseline _ perf of the original data set, by the same K-fold Cross validation;

selecting a final data partitioning scheme: judging whether the split _ perf is more than baseline _ perf, if so, indicating that the current segmentation scheme is effective, and outputting a subdata set (D) obtained by segmenting the original data set ₁ ，D ₂ ，...，D _n ) And if not, indicating that the current segmentation scheme is invalid and outputting an original data set D.

The data hierarchical classification method shown in fig. 4 is tested and verified, and the original data set and the experimental results are as follows:

1) Raw data set description: selecting a test data set containing 2000 samples for verification, wherein the design parameters are x ₁ ，x ₂ ，x ₃ The target parameter is y.

2) Setting parameters: the number of the sub data sets to be segmented is set to be 2, namely the sample classification number n is set to be 2, the original data set is segmented into 2 sub data sets, the number of the K-fold cross validation is set to be 10, and the hyper-parametric optimization of the model is started, so that the segmentation independent modeling work can be conveniently carried out on the more accurate GBDT model, and whether the current data hierarchical classification operation can effectively improve the modeling precision or not is objectively judged.

3) Evaluation indexes are as follows: RMAE was selected as an evaluation index for evaluating the performance of a model, which is defined as follows:

wherein i =1, 2.. N is the sample number, n is the number of samples，y _i Is the true value of the sample i,

and the predicted value corresponding to the sample i. The smaller the RMAE, the higher the accuracy of the model.

4) The experimental results are as follows: the data hierarchical classification method is intuitive in operation logic, and aiming at a test data set, before the super-parameter optimization function is not started, the performance improvement shown in the table 1 is obtained:

TABLE 1 regression model error based on data hierarchical classification method

Type (B)	RMAE value
		baseline_estimator	9.81％
estimators(n＝2)	3.64％

That is, on the test data set, by aiming at n =2 and obtaining conservative performance estimation of model estimation accuracy through the same cross validation operation, it can be found that under the condition of not increasing data scale and changing machine learning algorithm, a large improvement of model performance can be obtained only by introducing a data hierarchical classification method, and the original near 10% estimation error is reduced to 3.64%. The comparison between the predicted value and the true value of the two models is shown in fig. 5, where the basic scheme in fig. 5 refers to a scheme obtained by training a base classifier on an original data set, and the segmentation scheme refers to a current segmentation scheme.

Modifying the sample classification number n, and starting the preferred segmentation function of hierarchical classification, it can be found that the hierarchical classification method recommends a segmentation scheme of n =3, and obtains the performance statistics as shown in table 2 below:

TABLE 2 regression model error based on data hierarchical classification method after starting optimal segmentation function

Type (B)	RMAE value
		baseline_estimator	9.81％
estimators(n＝3)	2.79％

Compared with a segmentation scheme provided by a user, the intelligent segmentation function of the data hierarchical classification method can provide further performance mining and improvement. Error pairs of the three regression models are shown in fig. 6, wherein the basic model refers to a regression model trained without segmenting the original data set; the user-specified data segmentation model is a regression model obtained by training an original data set after segmentation by a scheme specified by a user; the intelligent segmentation model is a regression model obtained by training an original data set after intelligent segmentation by a data hierarchical classification method.

Through the visualization result of the test data set shown in fig. 7, it can also be found that a plurality of subclass modes obviously exist in the test data set, and the data hierarchical classification method successfully improves the accuracy of data modeling through two subclass modes of "divide and conquer".

Another embodiment of the present disclosure relates to a data hierarchical classification apparatus, which is applied to a ship-body line design, as shown in fig. 8, and includes:

a pre-segmentation module 801, configured to pre-segment the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design;

a classification training module 802 for classifying and training the sub data sets: respectively adding a subclass label to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;

a verification module 803, configured to verify the data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation;

a selecting module 804, configured to select a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.

The specific implementation method of the data hierarchical classification apparatus provided in the embodiments of the present disclosure may be referred to as the data hierarchical classification method provided in the embodiments of the present disclosure, and details are not repeated here.

Compared with the prior art, the data scale of the industrial design problem is objectively limited, the problem that multiple mixed modes exist in the data set of an industrial design section or the consistency of the interior of the data set is poor is solved, the data set is preprocessed by using a data layering classification method in the hull form line design driven by the industrial data set for the first time, the quality of the data set is purified by means of pre-layering operation by mining the multiple mixed modes in a sample training set, the data modeling precision is improved, the utilization rate of accumulated hull form data of an enterprise by designers is improved, the application range is wide, and the intelligent design of the hull form line is effectively assisted.

Another embodiment of the present disclosure relates to an electronic device, as shown in fig. 9, including:

at least one processor 901; and the number of the first and second groups,

a memory 902 communicatively connected to the at least one processor 901; wherein the content of the first and second substances,

the memory 902 stores instructions executable by the at least one processor 901, and the instructions are executable by the at least one processor 901 to enable the at least one processor 901 to perform the data hierarchical classification method according to the above embodiments.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, etc., which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Another embodiment of the present disclosure relates to a computer-readable storage medium storing a computer program, which when executed by a processor implements the data hierarchical classification method described in the above embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method according to the foregoing embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method according to each embodiment of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a ROM (Read-Only Memory), a RAM (Random Access Memory), a magnetic disk, or an optical disk.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific to implementations of the present disclosure, and that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure in practice.

Claims

1. A data hierarchical classification method is applied to hull profile design and is characterized by comprising the following steps:

pre-segmentation of the original data set: according to a sample classification number designated by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and segmenting the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data segmentation scheme, wherein the original data set is an industrial data set in hull molded line design;

classifying the training subdata set: respectively adding subclass labels to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;

verifying the data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining the subclass classifier and cross validation;

selecting a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to an evaluation result.

2. The method of claim 1, wherein the cross validation comprises K-fold cross validation, and wherein the determining, in conjunction with the subclass classifier and cross validation, the proxy performance of the current data splitting scheme and the proxy performance of the original data set after obtaining an original data set regression model and a plurality of sub data set regression models comprises:

dividing the original data set into K original data subsets randomly and equally, taking one original data subset as a test set and the corresponding other original data subsets as training sets in turn, training and testing the multiple sub data set regression models and the original data set regression model based on the training sets, the test sets and the subclass classifiers, and respectively obtaining errors corresponding to the multiple sub data set regression models and errors corresponding to the original data set regression model;

3. The method of claim 2, wherein training and testing the multiple sub data set regression models and the original data set regression model based on the training set, the testing set, and the subclass classifier to obtain errors corresponding to the multiple sub data set regression models and errors corresponding to the original data set regression model, respectively, comprises:

training the plurality of regression models of subdata sets based on the training set;

judging the subclass category of each sample in the test set based on the subclass classifier, determining a subdata set regression model corresponding to each sample from the trained subdata set regression models based on the judged subclass category, and inputting each sample in the test set into the corresponding subdata set regression model respectively to obtain a predicted value corresponding to each sample;

and determining errors corresponding to the regression models of the plurality of subdata sets based on the real values of the samples in the test set and the predicted values corresponding to the real values.

4. The method of claim 3, wherein determining the error for the plurality of regression models for the subset based on the actual value and the predicted value for each sample in the test set comprises:

wherein j =1,2, \ 8230;, K is the test set number, E _j The relative average absolute error of the regression model of the plurality of subdata sets corresponding to the test set j is i =1,2, \ 8230, n is the number of samples in the test set j, and y is the number of samples in the test set j _i To test the true value of the ith sample in set j,

5. The method of claim 4, wherein determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set based on the errors corresponding to the plurality of sub data set regression models and the errors corresponding to the original data set regression model, respectively, comprises:

determining a proxy performance of the current data partitioning scheme according to equation (2) below:

wherein, the spatepf is the proxy performance of the current data partitioning scheme.

6. The method of claim 2, wherein the training and testing the multiple regression sub data set models and the regression original data set model based on the training set, the testing set, and the subclass classifier to obtain errors corresponding to the multiple regression sub data set models and errors corresponding to the regression original data set model, respectively, further comprises:

training the regression model of the original data set based on the training set;

and determining the error corresponding to the regression model of the original data set based on the real value and the predicted value corresponding to each sample in the test set.

7. The method of any one of claims 1 to 6, wherein the random forest classifier is built based on a classification regression tree model.

8. A data hierarchical classification device is applied to hull profile design, and is characterized by comprising:

the pre-segmentation module is used for pre-segmenting the original data set: according to a sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design;

the classification training module is used for classifying and training the subdata set: respectively adding subclass labels to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;

a verification module to verify a data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining the subclass classifier and cross validation;

a selection module to select a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to an evaluation result.

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of data hierarchical classification of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of data hierarchical classification according to one of claims 1 to 7.