CN115600121A - Data hierarchical classification method and device, electronic equipment and storage medium - Google Patents

Data hierarchical classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115600121A
CN115600121A CN202210446117.7A CN202210446117A CN115600121A CN 115600121 A CN115600121 A CN 115600121A CN 202210446117 A CN202210446117 A CN 202210446117A CN 115600121 A CN115600121 A CN 115600121A
Authority
CN
China
Prior art keywords
data set
original data
training
regression
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210446117.7A
Other languages
Chinese (zh)
Other versions
CN115600121B (en
Inventor
张明
张儒
郭震
金云峰
孙自飞
甘雨
路明标
姜栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tianfu Software Co ltd
Original Assignee
Nanjing Tianfu Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tianfu Software Co ltd filed Critical Nanjing Tianfu Software Co ltd
Priority to CN202210446117.7A priority Critical patent/CN115600121B/en
Publication of CN115600121A publication Critical patent/CN115600121A/en
Application granted granted Critical
Publication of CN115600121B publication Critical patent/CN115600121B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/15Vehicle, aircraft or watercraft design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Geometry (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to the technical field of data processing in hull profile design, and provides a data hierarchical classification method and device, electronic equipment and a storage medium, which are applied to hull profile design, wherein the method comprises the following steps: s101, pre-segmenting an original data set; s102, classifying the training subdata set; s103, verifying a data segmentation scheme; s104 selects a final data partitioning scheme. Based on objective limitation of data scale in industrial design, aiming at the problem that various mixed modes exist in industrial design data set or the consistency in the data set is poor, the data set is preprocessed by using a data layering classification method in the design of hull type lines driven by the industrial data set for the first time, various mixed modes in a sample training set are mined, the quality of the data set is purified by means of preposed layering operation, the accuracy of data modeling is improved, the utilization rate of accumulated hull type data by designers is improved, the application range is wide, and the intelligent design of the hull type lines is effectively assisted.

Description

Data hierarchical classification method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies in hull profile design, and in particular, to a data hierarchical classification method and apparatus, an electronic device, and a storage medium.
Background
The traditional ship design is an empirical design in nature, and the final decision result depends on the subjective experience and knowledge structure level of a decision maker to a great extent. The expert consulting method (Delphi) takes subjective judgment of experts as a decision basis, takes scores, indexes, ordinal numbers, comments and the like as evaluation criteria, is a simple method lacking in theories and systematicness, and is difficult to guarantee objective authenticity of an evaluation result. An Analytic Hierarchy Process (AHP) is used for researching a multi-target decision problem with a more complex structure, and can quantify a qualitative problem, so that an evaluation result is more scientific and reasonable. The method obtains the judgment matrix reflecting the relative importance of each attribute by comparing the evaluation indexes pairwise, so that the reliability is high, the error is small, and the defects that the judgment matrix cannot meet the requirement of consistency easily due to the limitation of the knowledge structure, personal preference, judgment level and the like of a decision maker.
With the advent and development of advanced intelligent technology, scientific and reasonable Decision-making modes are introduced into the design process, and the concept of Decision Support Systems (DSS) pushes Decision-making theory to a new high-grade development, which has achieved great achievements in the fields of System engineering, management science, and the like, and is often used for solving the Decision-making problem of semi-structured and unstructured complex information systems. In recent years, the emergence of advanced intelligent technologies such as Online Analytical Processing (OLAP) and Data Mining (DM) based on Data Warehouse (DW) opens up a new approach for the development of DSS.
The ship type decision support system comprises a database and database management module, a model base and model base management module, a knowledge base and knowledge base management module, a data warehouse and database management module, a data mining module, a knowledge discovery module, a man-machine interaction module and the like. The data mining module and the knowledge discovery module are responsible for carrying out operations such as query, analysis, mining, selection and evaluation on data, and mining out decision information hidden in the data by adopting intelligent technologies such as genetic algorithm, neural network, statistical analysis, machine learning and fuzzy decision.
The intelligent technology is mostly driven by data, and how to utilize the ship type data accumulated by the enterprise to provide efficient reference for designers is the main research content of data mining. In the prior art, the ship type highly related to the design requirement is selected from the accumulated ship type data as a female type to guide the ship type design, but the utilization rate of the ship type data is extremely low, only excellent ship type data highly related to the design requirement is utilized, and the mutual relation among the selected ship types is not considered.
The introduction of an agent model training technology based on an artificial intelligence technology is one of the key technologies for solving the problems. Considering that the ship model test or the actual measurement data is limited, the training sample of the agent model may be a simulation data sample provided by a Computational Fluid Dynamics (CFD) solution tool, and the test or actual measurement data may be used to correct the CFD solution model or the boundary condition. Through the technology, most data in the ship type database can be utilized, so that a designer is guided to carry out ship type design, and the utilization rate of the ship type data is greatly improved. Meanwhile, the evaluation time of the agent model is far shorter than the CFD simulation calculation time, and the engineering design period can be greatly shortened by using the agent model.
The agent model training technology based on the artificial intelligence technology can effectively solve the problems of long data utilization rate and design period, but the training and the use of the agent model also have some problems. For example, the difficulty of improving the training precision of the proxy model is increased due to the limited number of data training samples, the consistency of sample point classes, and the like, and particularly in the industrial data set data-driven learning problem, the industrial process is particularly under the condition that the data value of the former 'design segment' shows high single-point value density and small data set scale. At the moment, the learning algorithms such as machine learning generally have the problems of data shortage (data-hungry) and dimension disaster (dimension-cure), namely, the stronger the nonlinear expression capability of the model is, the higher the requirements on the scale and diversity of the training data are; however, the algorithm model with general nonlinear expression capability cannot effectively extract the complex mapping mode in the training data set, and the model effect is difficult to bear relevant application.
Due to objective limitation of the scale of the data set in the industrial design stage, complex models such as deep learning cannot be effectively activated and used, only machine learning algorithms with stronger statistical attributes can be used, but the nonlinear expression capability of the machine learning algorithms is limited, and particularly under the condition that the data set in the industrial design stage has various mixed modes or the consistency in the data set is poor, the modeling effect of the learning algorithms is further weakened.
Disclosure of Invention
The present disclosure is directed to at least one of the problems in the prior art, and provides a data hierarchical classification method and apparatus, an electronic device, and a storage medium.
One aspect of the present disclosure provides a data hierarchical classification method, which is applied to hull contour design, and includes the following steps:
pre-segmentation of the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian Mixture Model (GMM), and segmenting the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data segmentation scheme, wherein the original data set is an industrial data set in hull molded line design;
classifying the training subdata set: respectively adding a subclass label to each subdata set to obtain a training data set, and training a Random Forest (RF) classifier based on the training data set to obtain a subclass classifier;
verifying the data partitioning scheme: performing regression training on an original data set and a plurality of sub data sets respectively by using a regression algorithm model based on a Gradient Boosting Decision Tree (GBDT) to obtain an original data set regression model and a plurality of sub data set regression models, wherein each sub data set regression model corresponds to one sub data set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation;
selecting a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.
Optionally, the cross validation includes K-fold cross validation, and after obtaining the regression model of the original data set and the regression models of the multiple sub data sets, determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set respectively by combining the subclass classifier and the cross validation, including:
dividing an original data set into K original data subsets randomly and equally, taking one original data subset as a test set in turn and taking the other corresponding original data subsets as a training set, training and testing a plurality of sub data set regression models and an original data set regression model based on the training set, the test set and a subclass classifier, and respectively obtaining errors corresponding to the plurality of sub data set regression models and errors corresponding to the original data set regression model;
and respectively determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub data set regression models and the errors corresponding to the original data set regression model.
Optionally, training and testing the multiple regression sub-data sets and the regression original data set model based on the training set, the testing set, and the subclass classifier to obtain errors corresponding to the multiple regression sub-data sets and errors corresponding to the regression original data set, respectively, including:
training a plurality of subdata set regression models based on the training set;
judging the subclass category of each sample in the test set based on a subclass classifier, determining a subdata set regression model corresponding to each sample from a plurality of trained subdata set regression models based on the judged subclass category, and inputting each sample in the test set into the corresponding subdata set regression model respectively to obtain a predicted value corresponding to each sample;
and determining errors corresponding to the regression models of the plurality of subdata sets based on the real values and the corresponding predicted values of the samples in the test set.
Optionally, determining errors corresponding to the regression models of the multiple sub data sets based on the true values and the predicted values corresponding to the true values of the samples in the test set, including:
determining errors corresponding to the regression models of the plurality of subdata sets according to the following formula (1):
Figure BDA0003616935960000041
wherein j =1,2, \ 8230;, K is the test set number, E j The method comprises the steps of obtaining Relative Mean Absolute Error (RMAE) corresponding to a plurality of sub data set regression models and a test set j, wherein i =1,2, \ 8230, n is the sample number in the test set j, and y is the sample number in the test set j i To test the true value of the ith sample in set j,
Figure BDA0003616935960000042
and the predicted value corresponding to the ith sample in the test set j is obtained.
Optionally, the determining, based on errors corresponding to the multiple regression models of the subdata set and errors corresponding to the regression model of the original data set, a proxy performance of the current data partitioning scheme and a proxy performance of the original data set respectively includes:
determining a proxy performance of the current data splitting scheme according to the following equation (2):
Figure BDA0003616935960000051
wherein, split perf is the proxy performance of the current data splitting scheme.
Optionally, training and testing the multiple regression sub-data sets and the regression model of the original data set based on the training set, the testing set and the subclass classifier to obtain errors corresponding to the multiple regression sub-data sets and errors corresponding to the regression model of the original data set, respectively, further comprising:
training an original data set regression model based on the training set;
respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;
and determining the error corresponding to the regression model of the original data set based on the true value and the corresponding predicted value of each sample in the test set.
Optionally, the random forest classifier is established based on a Classification And Regression Tree (CART) model.
In another aspect of the present disclosure, a data hierarchical classification apparatus is provided, which is applied to hull profile design, and the apparatus includes:
the pre-segmentation module is used for pre-segmenting the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design;
the classification training module is used for classifying and training the subdata set: respectively adding a subclass label to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;
a verification module to verify a data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation;
a selection module to select a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.
In another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for hierarchical classification of data as described above.
In another aspect of the disclosure, a computer-readable storage medium is provided, in which a computer program is stored, and the computer program is executed by a processor to implement the data hierarchical classification method described above.
Compared with the prior art, the method is based on objective limitation of data scale of industrial design problems, aims at the problem that multiple mixed modes exist in the data set of an industrial design section or the consistency inside the data set is poor, performs preprocessing on the data set by using a data layering classification method in the hull profile design driven by the industrial data set for the first time, purifies the quality of the data set by adopting the preposed layering operation by mining the multiple mixed modes inside a sample training set, improves the precision of data modeling, improves the utilization rate of accumulated hull profile data of an enterprise by designers, has wide application range, and effectively assists the intelligent design of the hull profile.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
Fig. 1 is a flowchart of a data hierarchical classification method according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a data hierarchical classification method according to another embodiment of the present disclosure;
fig. 3 is a flowchart of a data hierarchical classification method according to another embodiment of the present disclosure;
fig. 4 is a flowchart of a data hierarchical classification method according to another embodiment of the present disclosure;
FIG. 5 is a graph illustrating the comparative effect of predicted values and actual values of two modeling schemes provided by another embodiment of the present disclosure;
FIG. 6 is a graph of error comparison results for three modeling solutions provided by another embodiment of the present disclosure;
FIG. 7 is a schematic diagram illustrating a visualization result of a test data set according to another embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a data hierarchical classification apparatus according to another embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in various embodiments of the disclosure, numerous technical details are set forth in order to provide a better understanding of the disclosure. However, the technical solutions claimed in the present disclosure can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation of the present disclosure, and the embodiments may be mutually incorporated and referred to without contradiction.
One embodiment of the present disclosure relates to a data hierarchical classification method, which is applied to hull contour design, and the flow of the data hierarchical classification method is shown in fig. 1, and includes the following steps:
s101: pre-segmentation of the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design.
Specifically, for the original data set D, the unsupervised clustering algorithm gaussian mixture model may be utilized to divide the original data set D into split _ n sub-data sets according to the sample classification number split _ n (D) 1 ,D 2 ,...,D split_n ) And obtaining the current data segmentation scheme. The current data segmentation scheme is that the original data set is subjected to clustering and layering processing by adopting a Gaussian mixture model according to the sample classification number split _ n specified by a user, and the original data set is segmented into split _ n sub-data sets.
S102: classifying the training subdata set: and respectively adding a subclass label to each subdata set to obtain a training data set, and training the random forest classifier based on the training data set to obtain a subclass classifier.
Specifically, the subclass label is denoted as table =1,2 _splited And is the subdata set (D) 1 ,D 2 ,...,D split_n ) After adding the subclass labels, the resulting training data set can be represented as D _splited ={(D 1 ,lable=1),(D 2 ,lable=2),...,(D split-n Able = split-n). Based on a training data set, a random forest classifier in a supervised algorithm is adopted for training to obtain a subclass classifier, and the subclass classifier is used as a middle-stage classifier for judging subclass attribution of a newly-entered data sample so as to determine a regression model to be activated.
The random forest classifier is a special guided aggregation algorithm (bootstrapping aggregation, bagging algorithm), and uses a decision tree CART algorithm as an elementary model in a bagging strategy. Firstly, generating m training sets on a native data set by a self-service sampling integration method, then constructing an independent decision tree for each training set, when a node finds features to split, not finding all the features to enable indexes (such as information gain) to be maximum, but randomly extracting a part of features in the features, finding an optimal solution among the extracted features, and applying the optimal solution to the node to split. The random forest method is equivalent to sampling samples and features, so that the over-fitting problem can be effectively avoided.
S103: verifying the data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation.
Specifically, this step may use a regression algorithm model based on a gradient-boosting decision tree to separately perform a process on the original dataset D and the split _ n sub-datasets (D) 1 ,D 2 ,...,D split_n ) Carrying out regression training to obtain a regression model estimator of the original data set baseline And split _ n sub-dataset regression model estimator 1 ,estimator 2 ,...,estimator split_n . And respectively obtaining the agent performance split _ perf of the current data segmentation scheme and the agent performance baseline _ perf of the original data set by taking cross validation as flow logic, thereby verifying whether the current data segmentation scheme can effectively improve the modeling regression effect.
S104: selecting a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to an evaluation result.
Specifically, the current data partitioning scheme is evaluated according to the proxy performance split _ perf of the current data partitioning scheme and the proxy performance baseline _ perf of the original data set, and the sizes of the split _ perf and the baseline _ perf are compared. If the split _ perf > baseline _ perf is satisfied, the performance of the current data partitioning scheme is higher, partitioning is effective, and a subdata set (D) is output 1 ,D 2 ,...,D split_n ) The subdata set (D) 1 ,D 2 ,...,D split_n ) Namely the final data segmentation result. If the split _ perf > baseline _ perf is not satisfied, the performance of the current data segmentation scheme is low, segmentation is invalid, and the original data set D is output.
Compared with the prior art, the data scale of the industrial design problem is objectively limited, the problem that multiple mixed modes exist in the data set of the industrial design section or the consistency inside the data set is poor is solved, the data set is preprocessed by using a data layering classification method in the hull profile design driven by the industrial data set for the first time, the quality of the data set is purified by mining the multiple mixed modes inside the sample training set through preposed layering operation, the accuracy of data modeling is improved, the utilization rate of accumulated hull profile data of an enterprise by designers is improved, the application range is wide, and the intelligent design of the hull profile is effectively assisted.
Illustratively, before step S101, an obtaining step of obtaining a sample classification number specified by a user and a raw data set may be further included.
Illustratively, cross-validation includes K-fold (K-fold) cross-validation.
Specifically, the basic idea of K-fold cross validation is that the initial sample is divided into K sub-samples, one individual sub-sample is retained as data for the validation model, and the other K-1 samples are used for training. Cross validation is repeated K times, each sub-sample is validated once, the K results are averaged or other combinations are used, and a single estimate is obtained. The K-fold cross validation has the advantages that the randomly generated sub-samples are repeatedly used for training and validation, the result of each time is validated once, all samples of the training set inevitably become training data and also inevitably have an opportunity to become a test set, and the training set data can be better utilized. Among them, K is generally taken as 2-10, and 10-fold cross validation is the most commonly used.
Illustratively, after obtaining the regression model of the original data set and the regression models of the plurality of sub data sets, respectively determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set by combining a subclass classifier and cross validation, the method includes:
and training and testing the multiple sub data set regression models and the original data set regression model based on the training set, the testing set and the subclass classifier to respectively obtain errors corresponding to the multiple sub data set regression models and errors corresponding to the original data set regression model.
Specifically, the original data set D is randomly equally divided into K original data subsets (D' 1 ,D′ 2 ,...,D′ K ) D 'are taken in turn' 1 ,D′ 2 ,...,D′ K As a test set, the corresponding remaining subset of raw data, namely (D' 2 ,D′ 3 ,...,D′ K ),(D′ 1 ,D′ 3 ,...,D′ K ),...,(D′ 1 ,D′ 2 ,...,D′ K-1 ) As a training set, a subclass classifier pair is utilizedSplit _ n sub-dataset regression model estimator 1 ,estimator 2 ,...,estimator split_n And the regression model estimator of the original data set baselin_e And training and testing to respectively obtain the error corresponding to the regression model of the split _ n sub-data set and the error corresponding to the regression model of the original data set.
And respectively determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub data set regression models and the errors corresponding to the original data set regression model.
By performing K-fold cross validation on the current data segmentation scheme, training set data can be better utilized, and the obtained evaluation result, namely the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, can be as close as possible to the performance of the model on the test set.
Illustratively, training and testing the regression models of the multiple sub data sets and the regression model of the original data set based on the training set, the testing set and the subclass classifier to obtain errors corresponding to the regression models of the multiple sub data sets and errors corresponding to the regression model of the original data set, respectively, includes the following steps, as shown in fig. 2:
s201: training a plurality of subdata set regression models based on the training set;
s202: judging the subclass category of each sample in the test set based on a subclass classifier, determining a subdata set regression model corresponding to each sample from a plurality of trained subdata set regression models based on the judged subclass category, and inputting each sample in the test set into the corresponding subdata set regression model respectively to obtain a predicted value corresponding to each sample;
s203: and determining errors corresponding to the regression models of the plurality of subdata sets based on the real values of the samples in the test set and the predicted values corresponding to the real values.
Specifically, since K original data subsets need to be taken as test sets in turn and the corresponding remaining original data subsets need to be taken as training sets, the original data subsets D 'need to be respectively taken' 1 ,D′ 2 ,...,D′ K As a test set, the corresponding remaining subset of raw data, namely (D' 2 ,D′ 3 ,...,D′ K ),(D′ 1 ,D′ 3 ,...,D′ K ),...,(D′ 1 ,D′ 2 ,...,D′ K-1 ) And (5) repeating the steps S201 to S203 for K times as a training set to obtain errors of the split _ n sub-data set regression model corresponding to each test set.
Illustratively, training and testing the regression models of the multiple sub data sets and the regression model of the original data set based on the training set, the testing set and the subclass classifier to obtain errors corresponding to the regression models of the multiple sub data sets and errors corresponding to the regression model of the original data set, respectively, further includes the following steps, as shown in fig. 3:
s301: training an original data set regression model based on the training set;
s302: respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;
s303: and determining the error corresponding to the regression model of the original data set based on the true value and the corresponding predicted value of each sample in the test set.
Specifically, since K original data subsets need to be taken as test sets in turn and the corresponding remaining original data subsets need to be taken as training sets, the original data subsets D 'need to be respectively taken' 1 ,D′ 2 ,...,D′ K As a test set, the corresponding remaining subset of raw data, namely (D' 2 ,D′ 3 ,...,D′ K ),(D′ 1 ,D′ 3 ,...,D′ K ),...,(D′ 1 ,D′ 2 ,...,D′ K-1 ) And (5) repeating the steps S301 to S303 for K times as a training set to obtain the errors of the regression model of the original data set corresponding to each test set.
Illustratively, determining errors corresponding to the regression models of the plurality of subdata sets based on the real values and the corresponding predicted values of the samples in the test set includes:
determining errors corresponding to the regression models of the plurality of subdata sets according to the following formula (1):
Figure BDA0003616935960000121
wherein j =1, 2.. K is the test set number, E j For the relative mean absolute error of the regression model of the multiple subdata sets corresponding to the test set j, i =1,2 i To test the true value of the ith sample in set j,
Figure BDA0003616935960000122
and the predicted value corresponding to the ith sample in the test set j is obtained.
In addition, E is j Replacing the relative mean absolute error corresponding to the original data set regression model and the test set j, and replacing y i Replacing the real value of the ith sample in the test set j with the real value
Figure BDA0003616935960000123
And replacing the test set j with a predicted value corresponding to the ith sample in the test set j to obtain an error corresponding to the regression model of the original data set.
Illustratively, determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set based on the errors corresponding to the regression models of the plurality of sub data sets and the errors corresponding to the regression model of the original data set, respectively, includes:
determining proxy performance of the current data partitioning scheme according to equation (2) below:
Figure BDA0003616935960000124
wherein, split perf is the proxy performance of the current data splitting scheme.
Note that the proxy capability of replacing the split perf with the original data set, and E j Replace by the relative average of the original dataset regression model corresponding to test set jAnd absolute error, namely obtaining the proxy performance of the original data set.
Illustratively, the random forest classifier is built based on a classification regression tree model.
Specifically, the principle of classifying the regression tree CART model is as follows:
inputting: a training data set;
and (3) outputting: classifying the regression tree f (x);
in an input space where a training data set is located, recursively dividing each region into two sub-regions, determining an output value on each sub-region, and constructing a binary decision tree:
1) Selecting an optimal segmentation variable j and an optimal segmentation point s, and solving:
Figure BDA0003616935960000131
traversing the variable j, scanning a segmentation point s for the fixed segmentation variable j, and selecting a pair (j, s) which enables the above formula to reach the minimum value;
2) Dividing the region by the selected pair (j, s) and determining the corresponding output value:
R 1 (j,s)=x|x (j) ≤s,R 2 (j,s)=x|x (j) >s
Figure BDA0003616935960000132
3) Continuing to call the steps 1) and 2) for the two sub-areas until a stop condition is met;
4) Dividing an input space into M regions R 1 ,R 2 ,...,R M Generating a decision tree:
Figure BDA0003616935960000133
illustratively, the gaussian mixture model GMM in step S101 is a linear combination of a plurality of gaussian distribution functions, and is formulated as:
Figure BDA0003616935960000134
wherein (mu) k ,∑ k ) Is a parameter of a k-th Gaussian distribution function, pi k The probability of being selected as class k for the current point. The core of the idea of the GMM algorithm is as follows: by adjusting (pi) k ,μ k ,∑ k ) Combining parameters to make GMM model obtain maximum likelihood probability on current data set, wherein the likelihood probability calculation formula is as follows:
Figure BDA0003616935960000135
the solution process of the GMM algorithm involves the use of a maximum Expectation-Maximization (EM) algorithm, which is divided into two steps, wherein the first step is to obtain a rough value of a parameter to be estimated, and the second step is to maximize a likelihood function by using a parameter estimation value of the first step. Introducing an intermediate implicit variable gamma (z) nk ) Which represents the nth point x n Posterior probability assigned to class k:
Figure BDA0003616935960000141
according to M steps (M-step) of the EM algorithm, partial derivatives of (pi, mu, sigma) parameters are solved according to likelihood probability and are juxtaposed to be 0, and the following calculation formula is obtained:
Figure BDA0003616935960000142
Figure BDA0003616935960000143
Figure BDA0003616935960000144
wherein:
Figure BDA0003616935960000145
recalculating the log-likelihood function of the GMM model based on the updated (pi, mu, sigma) parameters, i.e.:
Figure BDA0003616935960000146
checking whether the parameters (pi, mu and sigma) are converged or whether the log-likelihood function is converged, and if not, repeating the iteration process. To this end, based on the GMM iterative correction logic, a mixed distribution statistical model of the current mixed data set may be obtained, and based on the mixed distribution statistical model, classification of the training set samples is achieved.
Illustratively, the gradient boosting decision tree GBDT algorithm in step S103 is an iterative decision tree algorithm. The algorithm is an additive combination of a series of regression trees (CART): and (4) fitting the 'residual error' between the predicted result and the target before the next tree, and accumulating the results of all the trees to obtain a final answer. The principle of the GBDT algorithm is as follows:
1) Initializing the weak learner:
Figure BDA0003616935960000147
2) For M =1,2, M has:
(a) For each sample i =1,2, ·, N, a negative gradient, i.e. a residual, is calculated:
Figure BDA0003616935960000148
(b) Taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x) i ,r mi ) I =1, 2.. N, as training data of the next tree, a new regression tree f is obtained m (x) The corresponding leaf node region is R jm J =1,2.., J. Wherein J is the number of leaf nodes of the regression tree t.
(c) Calculate the best fit for leaf region J =1,2.,:
Figure BDA0003616935960000151
(d) Updating the strong learner:
Figure BDA0003616935960000152
3) Obtaining a final learner:
Figure BDA0003616935960000153
in order to enable those skilled in the art to better understand the above embodiments, a specific example is described below.
As shown in fig. 4, a data hierarchical classification method applied to hull contour design includes the following steps:
pre-segmentation of the original data set: according to a parameter configured by a user, namely a sample classification number n, adopting GMM to perform clustering and layering operation on an original data set D, dividing the original data set D into n sub-data sets, and outputting a sub-data set (D) obtained by dividing the original data set D 1 ,D 2 ,...,D n ) Obtaining a current data segmentation scheme, wherein an original data set is an industrial data set in the hull molded line design;
classifying the training subdata set: n sub-data sets (D) respectively 1 ,D 2 ,...,D n ) Adding subclass labels to obtain a training data set (D) 1 ,lable=1),(D 2 ,lable=2),...,(D n Table = n), based on the training data set, performing RF training on the data classification identifier to obtain a subclass classifier;
verifying the data partitioning scheme: utilizing a GBDT-based regression algorithm model to train predictors (estimators) for the sub-dataset agents to obtain n sub-dataset regression models, wherein each sub-dataset regression model corresponds to one sub-dataset, and performing base classifier training (base _ estimator training) on an undivided (No-split) original dataset (namely, a full dataset agent) to obtain an original dataset regression model; testing the current segmentation scheme by adopting K-fold Cross validation (Cross-validation), obtaining the proxy performance split _ perf of the current segmentation scheme, and obtaining the original scheme, namely the proxy performance baseline _ perf of the original data set, by the same K-fold Cross validation;
selecting a final data partitioning scheme: judging whether the split _ perf is more than baseline _ perf, if so, indicating that the current segmentation scheme is effective, and outputting a subdata set (D) obtained by segmenting the original data set 1 ,D 2 ,...,D n ) And if not, indicating that the current segmentation scheme is invalid and outputting an original data set D.
The data hierarchical classification method shown in fig. 4 is tested and verified, and the original data set and the experimental results are as follows:
1) Raw data set description: selecting a test data set containing 2000 samples for verification, wherein the design parameters are x 1 ,x 2 ,x 3 The target parameter is y.
2) Setting parameters: the number of the sub data sets to be segmented is set to be 2, namely the sample classification number n is set to be 2, the original data set is segmented into 2 sub data sets, the number of the K-fold cross validation is set to be 10, and the hyper-parametric optimization of the model is started, so that the segmentation independent modeling work can be conveniently carried out on the more accurate GBDT model, and whether the current data hierarchical classification operation can effectively improve the modeling precision or not is objectively judged.
3) Evaluation indexes are as follows: RMAE was selected as an evaluation index for evaluating the performance of a model, which is defined as follows:
Figure BDA0003616935960000161
wherein i =1, 2.. N is the sample number, n is the number of samples,y i Is the true value of the sample i,
Figure BDA0003616935960000162
and the predicted value corresponding to the sample i. The smaller the RMAE, the higher the accuracy of the model.
4) The experimental results are as follows: the data hierarchical classification method is intuitive in operation logic, and aiming at a test data set, before the super-parameter optimization function is not started, the performance improvement shown in the table 1 is obtained:
TABLE 1 regression model error based on data hierarchical classification method
Type (B) RMAE value
baseline_estimator 9.81%
estimators(n=2) 3.64%
That is, on the test data set, by aiming at n =2 and obtaining conservative performance estimation of model estimation accuracy through the same cross validation operation, it can be found that under the condition of not increasing data scale and changing machine learning algorithm, a large improvement of model performance can be obtained only by introducing a data hierarchical classification method, and the original near 10% estimation error is reduced to 3.64%. The comparison between the predicted value and the true value of the two models is shown in fig. 5, where the basic scheme in fig. 5 refers to a scheme obtained by training a base classifier on an original data set, and the segmentation scheme refers to a current segmentation scheme.
Modifying the sample classification number n, and starting the preferred segmentation function of hierarchical classification, it can be found that the hierarchical classification method recommends a segmentation scheme of n =3, and obtains the performance statistics as shown in table 2 below:
TABLE 2 regression model error based on data hierarchical classification method after starting optimal segmentation function
Type (B) RMAE value
baseline_estimator 9.81%
estimators(n=3) 2.79%
Compared with a segmentation scheme provided by a user, the intelligent segmentation function of the data hierarchical classification method can provide further performance mining and improvement. Error pairs of the three regression models are shown in fig. 6, wherein the basic model refers to a regression model trained without segmenting the original data set; the user-specified data segmentation model is a regression model obtained by training an original data set after segmentation by a scheme specified by a user; the intelligent segmentation model is a regression model obtained by training an original data set after intelligent segmentation by a data hierarchical classification method.
Through the visualization result of the test data set shown in fig. 7, it can also be found that a plurality of subclass modes obviously exist in the test data set, and the data hierarchical classification method successfully improves the accuracy of data modeling through two subclass modes of "divide and conquer".
Another embodiment of the present disclosure relates to a data hierarchical classification apparatus, which is applied to a ship-body line design, as shown in fig. 8, and includes:
a pre-segmentation module 801, configured to pre-segment the original data set: according to the sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design;
a classification training module 802 for classifying and training the sub data sets: respectively adding a subclass label to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;
a verification module 803, configured to verify the data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining a subclass classifier and cross validation;
a selecting module 804, configured to select a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to the evaluation result.
The specific implementation method of the data hierarchical classification apparatus provided in the embodiments of the present disclosure may be referred to as the data hierarchical classification method provided in the embodiments of the present disclosure, and details are not repeated here.
Compared with the prior art, the data scale of the industrial design problem is objectively limited, the problem that multiple mixed modes exist in the data set of an industrial design section or the consistency of the interior of the data set is poor is solved, the data set is preprocessed by using a data layering classification method in the hull form line design driven by the industrial data set for the first time, the quality of the data set is purified by means of pre-layering operation by mining the multiple mixed modes in a sample training set, the data modeling precision is improved, the utilization rate of accumulated hull form data of an enterprise by designers is improved, the application range is wide, and the intelligent design of the hull form line is effectively assisted.
Another embodiment of the present disclosure relates to an electronic device, as shown in fig. 9, including:
at least one processor 901; and the number of the first and second groups,
a memory 902 communicatively connected to the at least one processor 901; wherein the content of the first and second substances,
the memory 902 stores instructions executable by the at least one processor 901, and the instructions are executable by the at least one processor 901 to enable the at least one processor 901 to perform the data hierarchical classification method according to the above embodiments.
Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, etc., which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
Another embodiment of the present disclosure relates to a computer-readable storage medium storing a computer program, which when executed by a processor implements the data hierarchical classification method described in the above embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method according to the foregoing embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method according to each embodiment of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a ROM (Read-Only Memory), a RAM (Random Access Memory), a magnetic disk, or an optical disk.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific to implementations of the present disclosure, and that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure in practice.

Claims (10)

1. A data hierarchical classification method is applied to hull profile design and is characterized by comprising the following steps:
pre-segmentation of the original data set: according to a sample classification number designated by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and segmenting the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data segmentation scheme, wherein the original data set is an industrial data set in hull molded line design;
classifying the training subdata set: respectively adding subclass labels to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;
verifying the data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining the subclass classifier and cross validation;
selecting a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to an evaluation result.
2. The method of claim 1, wherein the cross validation comprises K-fold cross validation, and wherein the determining, in conjunction with the subclass classifier and cross validation, the proxy performance of the current data splitting scheme and the proxy performance of the original data set after obtaining an original data set regression model and a plurality of sub data set regression models comprises:
dividing the original data set into K original data subsets randomly and equally, taking one original data subset as a test set and the corresponding other original data subsets as training sets in turn, training and testing the multiple sub data set regression models and the original data set regression model based on the training sets, the test sets and the subclass classifiers, and respectively obtaining errors corresponding to the multiple sub data set regression models and errors corresponding to the original data set regression model;
and respectively determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set based on the errors corresponding to the multiple sub data set regression models and the errors corresponding to the original data set regression model.
3. The method of claim 2, wherein training and testing the multiple sub data set regression models and the original data set regression model based on the training set, the testing set, and the subclass classifier to obtain errors corresponding to the multiple sub data set regression models and errors corresponding to the original data set regression model, respectively, comprises:
training the plurality of regression models of subdata sets based on the training set;
judging the subclass category of each sample in the test set based on the subclass classifier, determining a subdata set regression model corresponding to each sample from the trained subdata set regression models based on the judged subclass category, and inputting each sample in the test set into the corresponding subdata set regression model respectively to obtain a predicted value corresponding to each sample;
and determining errors corresponding to the regression models of the plurality of subdata sets based on the real values of the samples in the test set and the predicted values corresponding to the real values.
4. The method of claim 3, wherein determining the error for the plurality of regression models for the subset based on the actual value and the predicted value for each sample in the test set comprises:
determining errors corresponding to the regression models of the plurality of subdata sets according to the following formula (1):
Figure FDA0003616935950000021
wherein j =1,2, \ 8230;, K is the test set number, E j The relative average absolute error of the regression model of the plurality of subdata sets corresponding to the test set j is i =1,2, \ 8230, n is the number of samples in the test set j, and y is the number of samples in the test set j i To test the true value of the ith sample in set j,
Figure FDA0003616935950000022
and the predicted value corresponding to the ith sample in the test set j is obtained.
5. The method of claim 4, wherein determining the proxy performance of the current data partitioning scheme and the proxy performance of the original data set based on the errors corresponding to the plurality of sub data set regression models and the errors corresponding to the original data set regression model, respectively, comprises:
determining a proxy performance of the current data partitioning scheme according to equation (2) below:
Figure FDA0003616935950000031
wherein, the spatepf is the proxy performance of the current data partitioning scheme.
6. The method of claim 2, wherein the training and testing the multiple regression sub data set models and the regression original data set model based on the training set, the testing set, and the subclass classifier to obtain errors corresponding to the multiple regression sub data set models and errors corresponding to the regression original data set model, respectively, further comprises:
training the regression model of the original data set based on the training set;
respectively inputting each sample in the test set into a trained original data set regression model to obtain a predicted value corresponding to each sample;
and determining the error corresponding to the regression model of the original data set based on the real value and the predicted value corresponding to each sample in the test set.
7. The method of any one of claims 1 to 6, wherein the random forest classifier is built based on a classification regression tree model.
8. A data hierarchical classification device is applied to hull profile design, and is characterized by comprising:
the pre-segmentation module is used for pre-segmenting the original data set: according to a sample classification number specified by a user, carrying out clustering hierarchical processing on an original data set by adopting a Gaussian mixture model, and dividing the original data set into a plurality of subdata sets corresponding to the sample classification number to obtain a current data division scheme, wherein the original data set is an industrial data set in the hull molded line design;
the classification training module is used for classifying and training the subdata set: respectively adding subclass labels to each subdata set to obtain a training data set, and training a random forest classifier based on the training data set to obtain a subclass classifier;
a verification module to verify a data partitioning scheme: performing regression training on the original data set and the plurality of subdata sets respectively by using a regression algorithm model based on a gradient lifting decision tree to obtain an original data set regression model and a plurality of subdata set regression models, wherein each subdata set regression model corresponds to one subdata set respectively, and determining the proxy performance of the current data segmentation scheme and the proxy performance of the original data set respectively by combining the subclass classifier and cross validation;
a selection module to select a final data partitioning scheme: and evaluating the current data segmentation scheme based on the proxy performance of the current data segmentation scheme and the proxy performance of the original data set, and determining a final data segmentation result according to an evaluation result.
9. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of data hierarchical classification of any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of data hierarchical classification according to one of claims 1 to 7.
CN202210446117.7A 2022-04-26 2022-04-26 Data hierarchical classification method and device, electronic equipment and storage medium Active CN115600121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210446117.7A CN115600121B (en) 2022-04-26 2022-04-26 Data hierarchical classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210446117.7A CN115600121B (en) 2022-04-26 2022-04-26 Data hierarchical classification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115600121A true CN115600121A (en) 2023-01-13
CN115600121B CN115600121B (en) 2023-11-07

Family

ID=84841991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210446117.7A Active CN115600121B (en) 2022-04-26 2022-04-26 Data hierarchical classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115600121B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020101453A4 (en) * 2020-07-23 2020-08-27 China Communications Construction Co., Ltd. An Intelligent Optimization Method of Durable Concrete Mix Proportion Based on Data mining
CN112396130A (en) * 2020-12-09 2021-02-23 中国能源建设集团江苏省电力设计院有限公司 Intelligent identification method and system for rock stratum in static sounding test, computer equipment and medium
CN113159220A (en) * 2021-05-14 2021-07-23 中国人民解放军军事科学院国防工程研究院工程防护研究所 Random forest based concrete penetration depth empirical algorithm evaluation method and device
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
WO2021164228A1 (en) * 2020-02-17 2021-08-26 平安科技(深圳)有限公司 Method and system for selecting augmentation strategy for image data
CN113609779A (en) * 2021-08-16 2021-11-05 深圳力维智联技术有限公司 Modeling method, device and equipment for distributed machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021164228A1 (en) * 2020-02-17 2021-08-26 平安科技(深圳)有限公司 Method and system for selecting augmentation strategy for image data
AU2020101453A4 (en) * 2020-07-23 2020-08-27 China Communications Construction Co., Ltd. An Intelligent Optimization Method of Durable Concrete Mix Proportion Based on Data mining
CN112396130A (en) * 2020-12-09 2021-02-23 中国能源建设集团江苏省电力设计院有限公司 Intelligent identification method and system for rock stratum in static sounding test, computer equipment and medium
CN113256066A (en) * 2021-04-23 2021-08-13 新疆大学 PCA-XGboost-IRF-based job shop real-time scheduling method
CN113159220A (en) * 2021-05-14 2021-07-23 中国人民解放军军事科学院国防工程研究院工程防护研究所 Random forest based concrete penetration depth empirical algorithm evaluation method and device
CN113609779A (en) * 2021-08-16 2021-11-05 深圳力维智联技术有限公司 Modeling method, device and equipment for distributed machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴琼;李运田;郑献卫;: "面向非平衡训练集分类的随机森林算法优化", 工业控制计算机, no. 07 *
熊冰妍;王国胤;邓维斌;: "基于样本权重的不平衡数据欠抽样方法", 计算机研究与发展, no. 11 *

Also Published As

Publication number Publication date
CN115600121B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
US20220076150A1 (en) Method, apparatus and system for estimating causality among observed variables
Reynolds et al. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms
CN112069310B (en) Text classification method and system based on active learning strategy
CN113255573B (en) Pedestrian re-identification method based on mixed cluster center label learning and storage medium
CN113344019A (en) K-means algorithm for improving decision value selection initial clustering center
CN102117411A (en) Method and system for constructing multi-level classification model
CN111325264A (en) Multi-label data classification method based on entropy
CN113807900A (en) RF order demand prediction method based on Bayesian optimization
Leon-Alcaide et al. An evolutionary approach for efficient prototyping of large time series datasets
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment
KR100895481B1 (en) Method for Region Based on Image Retrieval Using Multi-Class Support Vector Machine
CN116029379B (en) Method for constructing air target intention recognition model
CN112084294A (en) Whole vehicle electromagnetic compatibility grading management method based on artificial intelligence
CN115600102B (en) Abnormal point detection method and device based on ship data, electronic equipment and medium
CN115600121A (en) Data hierarchical classification method and device, electronic equipment and storage medium
CN116484244A (en) Automatic driving accident occurrence mechanism analysis method based on clustering model
CN114692746A (en) Information entropy based classification method of fuzzy semi-supervised support vector machine
CN113656707A (en) Financing product recommendation method, system, storage medium and equipment
Chakrapani et al. Predicting performance analysis of system configurations to contrast feature selection methods
Abdelatif et al. Optimization of the organized Kohonen map by a new model of preprocessing phase and application in clustering
CN111108516A (en) Evaluating input data using a deep learning algorithm
Liu Extracting Rules from Trained Machine Learning Models with Applications in Bioinformatics
CN116844649B (en) Interpretable cell data analysis method based on gene selection
US20230297651A1 (en) Cost equalization spectral clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant