CN115600102A - Abnormal point detection method and device based on ship data, electronic device and medium - Google Patents
Abnormal point detection method and device based on ship data, electronic device and medium Download PDFInfo
- Publication number
- CN115600102A CN115600102A CN202210444344.6A CN202210444344A CN115600102A CN 115600102 A CN115600102 A CN 115600102A CN 202210444344 A CN202210444344 A CN 202210444344A CN 115600102 A CN115600102 A CN 115600102A
- Authority
- CN
- China
- Prior art keywords
- ship data
- data set
- sample
- ship
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 87
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000000034 method Methods 0.000 claims abstract description 42
- 238000012360 testing method Methods 0.000 claims description 75
- 238000004422 calculation algorithm Methods 0.000 claims description 22
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000003066 decision tree Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000013450 outlier detection Methods 0.000 claims 2
- 238000013461 design Methods 0.000 abstract description 25
- 238000012545 processing Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 9
- 239000003795 chemical substances by application Substances 0.000 description 8
- 238000007418 data mining Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 4
- 238000005065 mining Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/10—Geometric CAD
- G06F30/15—Vehicle, aircraft or watercraft design
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Geometry (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computer Hardware Design (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Automation & Control Theory (AREA)
- Complex Calculations (AREA)
Abstract
The disclosure relates to the technical field of data processing in hull profile design, and provides an abnormal point detection method and device based on ship data, electronic equipment and a medium, wherein the method comprises the following steps: prepositioning and dividing a ship data set; regression training of the subdata set; calculating the trend accuracy; the detected outliers are determined. The method aims at the problem that multiple mixed modes exist in an industrial design section data set or the consistency inside the data set is poor, an abnormal point detection method is used in the hull molded line design driven by the industrial data set for the first time, the adaptability in the actual industry is considered, the trend accuracy of a sample is used as a judgment index to screen abnormal points, so that the precision of data modeling is improved, the utilization rate of accumulated hull data of an enterprise by designers is improved, the application range is wide, and the intelligent design of the hull molded line is effectively assisted.
Description
Technical Field
The present disclosure relates to the field of data processing technology in hull line design, and in particular, to a method and an apparatus for detecting an abnormal point based on ship data, an electronic device, and a medium.
Background
The existing ship design is basically an empirical design, and the final decision result greatly depends on the subjective experience and knowledge structure level of a decision maker. Common design methods include expert consultancy (Delphi), analytic Hierarchy Process (AHP), and Decision Support System (DSS).
The expert consulting method takes subjective judgment of experts as a decision-making basis, takes scores, indexes, ordinal numbers, comments and the like as evaluation criteria, is a simple method lacking in theories and systematicness, and is difficult to ensure objective authenticity of an evaluation result.
The analytic hierarchy process is used for researching a multi-target decision problem with a more complex structure, and can quantify a qualitative problem, so that an evaluation result tends to be more scientific and reasonable. The method obtains the judgment matrix reflecting the relative importance of each attribute by comparing the evaluation indexes pairwise, so that the reliability is high, the error is small, and the defects that the judgment matrix is difficult to meet the consistency due to the limitation of the knowledge structure, the personal preference, the judgment level and the like of a decision maker are overcome.
The concept of the decision support system pushes the decision theory to a new development climax, obtains great achievements in the fields of system engineering, management science and the like, and is commonly used for solving the decision problem of semi-structured and unstructured complex information systems. In recent years, the emergence of advanced intelligent technologies such as Online Analytical Processing (OLAP) and Data Mining (DM) based on Data Warehouse (DW) opens up a new approach for the development of DSS.
The ship type decision support system comprises a database and database management module, a model base and model base management module, a knowledge base and knowledge base management module, a data warehouse and database management module, a data mining module, a knowledge discovery module, a man-machine interaction module and the like. The data mining and knowledge discovery module is responsible for operations such as query, analysis, mining, selection, evaluation and the like of data, and mining the hidden decision information, and the common mining method comprises the following steps: genetic algorithm, neural network, statistical analysis, machine learning, fuzzy decision and other intelligent technologies.
How to utilize the ship-type data accumulated by the enterprise to provide efficient reference for designers is a main research content of data mining. In the prior art, the ship type highly related to the design requirement is selected from the accumulated ship type data as a female type to guide the ship type design, but the utilization rate of the ship type data is extremely low, only excellent ship type data highly related to the design requirement is utilized, and the mutual relation among the selected ship types is not considered.
The introduction of an agent model training technology based on an artificial intelligence technology is one of the key technologies for solving the problems. Considering that the ship model test or the actual measurement data is limited, the training sample of the agent model may be a simulation data sample provided by a Computational Fluid Dynamics (CFD) solution tool, and the test or actual measurement data is often used to correct the CFD solution model or the boundary condition. Through the technology, most data in the ship type database can be utilized, so that a designer is guided to carry out ship type design, and the utilization rate of the ship type data is greatly improved. Meanwhile, the evaluation time of the agent model is far shorter than the CFD simulation calculation time, and the engineering design period can be greatly shortened by using the agent model.
The agent model training technology based on the artificial intelligence technology can effectively solve the problems of long data utilization rate and design period, but the training and the use of the agent model also have some problems. For example, the consistency of sample point categories, abnormal points, data noise and the like all cause the difficulty of improving the precision of the proxy model to be large, and especially in the case of a small data set, the negative influence of the abnormal points on the precision of the proxy model is further amplified. There are many classification methods for detecting abnormal points, and currently, there are two widely accepted methods: one is an anomaly tag based on whether a user tag is available, including supervised, unsupervised and semi-supervised methods; the other is hypothesis classification based on normal and outliers, including statistical methods, proximity metric based methods, and cluster based methods. However, the above abnormal point detection algorithm can realize detection of abnormal points, but both have limitations, and cannot adapt to characteristics of limited ship data scale, high parameter dimensionality and the like.
Disclosure of Invention
The present disclosure is directed to at least one of the problems in the prior art, and provides a method and an apparatus for detecting an abnormal point based on ship data, an electronic device, and a medium.
In one aspect of the present disclosure, a method for detecting an abnormal point based on ship data is provided, which includes the following steps:
preposition segmentation of ship data set: equally dividing a ship data set provided by a user into a plurality of subdata sets;
regression training of the subdata set: taking one subdata set as a test set and the corresponding other subdata sets as training sets in turn, and respectively training a regression algorithm model on each training set to obtain a plurality of corresponding regression models;
calculating the accuracy rate of the trend: respectively testing the regression models based on each test set, and respectively calculating the trend accuracy of each sample in each test set based on the test result to obtain the trend accuracy of each sample in the ship data set;
determining the detected abnormal point: based on the trend accuracy rate of each sample in the ship data set and the preset number of the abnormal points, selecting the corresponding sample from the ship data set according to a preset selection rule to obtain the abnormal points in the ship data set.
Alternatively, the trend accuracy for each sample in the test set is represented by the following equation (1):
wherein i =1, 2.. N is the sample number in the test set, n is the number of samples in the test set, y i To test the true value of sample i in the set,to test the corresponding predicted value of sample i in the set, c (y) i ) Is y i A vector of magnitude relationships to the actual values of other samples in the test set,is composed ofVectors formed from the magnitude relationships of the predictors corresponding to other samples in the test set,is c (y) i ) And withInner product of, | c (y) i ) I andare respectively c (y) i ) Andl2 norm.
Optionally, c (y) i ) Represented by the following formula (2):
c(y i )={c(y i ,y j )|j=1,2,...,n} (2)
wherein, y j For testing the true value of a sample j other than sample i in the set, c (y) i ,y j ) Are intermediate variables and are represented by the following formula (3):
optionally, based on the trend accuracy of each sample in the ship data set and the preset number of the abnormal points, selecting the corresponding sample from the ship data set according to a preset selection rule to obtain the abnormal point in the ship data set, including:
and selecting the samples with the minimum trend accuracy from the ship data set according to the magnitude of the trend accuracy, and taking the selected samples with the front preset number as abnormal points in the ship data set.
Optionally, according to the magnitude of the trend accuracy, selecting a preset number of samples with the minimum trend accuracy from the ship data set, including:
sequencing all samples in the ship data set according to the trend accuracy;
and selecting the samples with the minimum trend accuracy from the sorted samples in the preset number.
Optionally, the regression algorithm model is established based on a Gradient Boosting Decision Tree (GBDT).
Optionally, equally dividing the ship data set provided by the user into a plurality of sub data sets, including:
and equally dividing the ship data set into a plurality of subdata sets by adopting a random sampling method.
In another aspect of the present disclosure, there is provided an abnormal point detecting apparatus based on ship data, including:
the segmentation module is used for prepositioning and segmenting the ship data set: equally dividing a ship data set provided by a user into a plurality of subdata sets;
a training module to regress the training subdata set: taking one subdata set as a test set and the corresponding other subdata sets as training sets in turn, and respectively training a regression algorithm model on each training set to obtain a plurality of corresponding regression models;
a calculation module to calculate a trend accuracy: respectively testing the regression models based on each test set, and respectively calculating the trend accuracy of each sample in each test set based on the test result to obtain the trend accuracy of each sample in the ship data set;
a determining module for determining the detected abnormal points: based on the trend accuracy rate of each sample in the ship data set and the preset number of abnormal points, selecting the corresponding sample from the ship data set according to a preset selection rule to obtain the abnormal points in the ship data set.
In another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described anomaly detection method based on vessel data.
In another aspect of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, and the computer program is executed by a processor to implement the above-mentioned abnormal point detection method based on ship data.
Compared with the prior art, the method has the advantages that the abnormal point detection method is used in the hull type line design driven by the industrial data set for the first time aiming at the problem that multiple mixed modes exist in the industrial design section data set or the consistency in the data set is poor, the adaptability in the actual industry is considered, the trend accuracy of the sample is used as a judgment index to screen the abnormal points, the data modeling precision is improved, the utilization rate of accumulated hull type data of an enterprise by designers is improved, the application range is wide, and the intelligent design of the hull type line is effectively assisted.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
Fig. 1 is a flowchart of an abnormal point detection method based on ship data according to an embodiment of the present disclosure;
fig. 2 is a flowchart of an abnormal point detection method based on ship data according to another embodiment of the present disclosure;
FIG. 3 is an eval _ CWWTC histogram of a vessel data set after manually removing data for significant anomalies as provided by another embodiment of the present disclosure;
FIG. 4 is an eval _ CWWTC histogram of a vessel data set after using an anomaly detection method according to another embodiment of the present disclosure;
fig. 5 is a graph illustrating MSE performance variation of eval _ CWTWC when random _ state =117 according to another embodiment of the present disclosure;
fig. 6 is a graph illustrating MSE performance variation of eval _ CWTWC when random _ state =721 provided by another embodiment of the present disclosure;
fig. 7 is a graph illustrating MSE performance variation of eval _ CWTWC when random _ state =1 according to another embodiment of the present disclosure;
fig. 8 is a graph illustrating MSE performance variation of eval _ CWTWC when random _ state =1102 according to another embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an abnormal point detecting device based on ship data according to another embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of an electronic device according to another embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in various embodiments of the disclosure, numerous technical details are set forth in order to provide a better understanding of the disclosure. However, the technical solution claimed in the present disclosure can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation of the present disclosure, and the embodiments may be mutually incorporated and referred to without contradiction.
One embodiment of the present disclosure relates to a method for detecting an abnormal point based on ship data, a flow of which is shown in fig. 1, and the method includes:
s101: preposition segmentation of ship data set: a ship data set provided by a user is equally divided into a plurality of subdata sets. In addition, the present embodiment does not limit the specific division method of equally dividing the ship data set into a plurality of sub data sets, and the ship data set may be equally divided into a plurality of sub data sets.
Preferably, the equally dividing the ship data set provided by the user into a plurality of sub data sets comprises:
and equally dividing the ship data set into a plurality of subdata sets by adopting a random sampling method.
By adopting a random sampling method to equally divide the ship data set into a plurality of subdata sets, the influence of subjective factors on the division of the subdata sets can be avoided.
For example, the step may adopt a data set division method in a K-Fold Cross-Validation (K-Fold CV) algorithm to divide the ship data set provided by the user into K sub-data sets (D) at random equally 1 ,D 2 ,…,D K )。
S102: regression training of the subdata set: and taking one subdata set as a test set and the corresponding other subdata sets as training sets in turn, and respectively training a regression algorithm model on each training set to obtain a plurality of corresponding regression models.
For example, for K sub-datasets (D) 1 ,D 2 ,…,D K ) One of the subdata sets can be used as a test set in turn, and the rest of the subdata sets correspond to the test setAnd taking the K-1 sub-data sets as training sets to obtain K test sets and training sets corresponding to the test sets, and respectively training regression algorithm models on the training sets to obtain a plurality of regression models respectively corresponding to the training sets.
Illustratively, the regression algorithm model is built based on a gradient boosting decision tree.
The GBDT algorithm is an iterative decision Tree algorithm that is an additive combination of a series of Classification And Regression Trees (CART): and (4) fitting the predicted result and the residual error of the target before the next tree, and accumulating the conclusions of all the trees to obtain a final answer. The GBDT algorithm fits gradient values (continuous values) at each iteration, and therefore the decision tree used is the CART regression tree. For the regression tree algorithm, the partitionable points contain all the possible values of all the features, so the algorithm is most important to find the optimal partition point. In the classification tree, the judgment standard of the optimal division point is entropy or a kini coefficient, but because the sample label in the regression tree is a continuous value, the square error is more applicable compared with indexes such as entropy or a kini coefficient, and the like, and the fitting degree can be judged well.
S103: calculating the trend accuracy: and respectively testing the regression models based on the test sets, and respectively calculating the trend accuracy of each sample in each test set based on the test result to obtain the trend accuracy of each sample in the ship data set.
Specifically, each subdata set is taken as a test set in turn in the embodiment, so that a ship data set is obtained by combining the test sets, and the trend accuracy of each sample in the ship data set can be obtained by respectively calculating the trend accuracy of each sample in each test set.
Illustratively, the trend accuracy for each sample in the test set is represented by the following equation (1):
wherein i =1, 2.. And n is a test setN is the number of samples in the test set, y i To test the actual value of sample i in the set,to test the corresponding predicted value of sample i in the set, c (y) i ) Is y i Vectors that are formed in magnitude relation to the true values of other samples in the test set,is composed ofVectors formed from the magnitude relationships of the predictors corresponding to other samples in the test set,is c (y) i ) And withInner product of, | c (y) i ) I andare respectively c (y) i ) Andl2 norm.
Exemplary, c (y) i ) Represented by the following formula (2):
c(y i )={c(y i ,y j )|j=1,2,...,n} (2)
wherein, y j For testing the true value of a sample j other than sample i in the set, c (y) i ,y j ) Are intermediate variables and are represented by the following formula (3):
wherein,for the predicted values corresponding to samples j other than sample i in the test set,is an intermediate variable and is represented by the following formula (5):
s104: determining the detected abnormal point: based on the trend accuracy rate of each sample in the ship data set and the preset number of the abnormal points, selecting the corresponding sample from the ship data set according to a preset selection rule to obtain the abnormal points in the ship data set.
Specifically, the preset number of the outliers may be specified by a user, or may be set by the user according to an actual requirement, which is not limited in this embodiment. The preset selection rule may be that a preset number of samples with a trend accuracy rate smaller than a preset threshold are used as the abnormal points in the ship data set, or a preset number of samples with a minimum trend accuracy rate are used as the abnormal points in the ship data set, or the preset selection rule may be another rule determined according to the trend accuracy rate of each sample in the ship data set and the preset number of the abnormal points, which is not limited in this embodiment.
Preferably, based on the trend accuracy of each sample in the ship data set and the preset number of the abnormal points, selecting the corresponding sample from the ship data set according to a preset selection rule to obtain the abnormal point in the ship data set, including:
and selecting a preset number of samples with the minimum trend accuracy from the ship data set according to the trend accuracy, and taking the selected preset number of samples as abnormal points in the ship data set.
By using the pre-set number of samples with the minimum trend accuracy as the abnormal points in the ship data set, the accuracy of abnormal point detection can be improved.
Illustratively, according to the magnitude of the trend accuracy, selecting a preset number of samples with the minimum trend accuracy from the ship data set, including:
sequencing all samples in the ship data set according to the trend accuracy;
and selecting the front preset number of samples with the minimum trend accuracy rate from the sorted samples.
Specifically, when the samples in the ship data set are sorted, the samples may be sorted from large to small according to the trend accuracy, or the samples may be sorted from small to large according to the trend accuracy, which is not limited in this embodiment. When the samples are sorted from large to small according to the trend accuracy, the samples with the rear preset number are the samples with the front preset number and the minimum trend accuracy. When the samples are sorted from small to large according to the trend accuracy, the samples with the preset number are the samples with the preset number with the minimum trend accuracy.
Compared with the prior art, the method for detecting the abnormal points is used in the design of the ship profile driven by the industrial data set for the first time aiming at the problem that various mixed modes exist in the data set of the industrial design section or the consistency of the interior of the data set is poor, the method for detecting the abnormal points is considered in the design of the ship profile driven by the industrial data set, the adaptability in the actual industry is considered, the trend accuracy of the sample is used as a distinguishing index to screen the abnormal points, the precision of data modeling is improved, the utilization rate of the ship profile data accumulated by a designer to an enterprise is improved, the application range is wide, and the intelligent design of the ship profile is effectively assisted.
Illustratively, the principle of the classification regression tree CART model in the above embodiment is as follows:
inputting: a training data set;
and (3) outputting: classifying the regression tree f (x);
in an input space where a training data set is located, recursively dividing each region into two sub-regions, determining an output value on each sub-region, and constructing a binary decision tree:
1) Selecting an optimal segmentation variable j and an optimal segmentation point s, and solving:
traversing the variable j, scanning a segmentation point s for the fixed segmentation variable j, and selecting a pair (j, s) which enables the above formula to reach the minimum value;
2) Dividing the region by the selected pair (j, s) and determining the corresponding output value:
R 1 (j,s)=x|x (j) ≤s,R 2 (j,s)=x|x (j) >s
3) Continuing to call the steps 1) and 2) for the two sub-areas until a stop condition is met;
4) Dividing an input space into M regions R 1 ,R 2 ,…,R M Generating a decision tree:
illustratively, the principle of the GBDT algorithm in the above embodiment is as follows:
1) Initializing the weak learner:
2) For M =1,2, \ 8230;, M has:
(a) For each sample i =1,2, \8230;, N, a negative gradient, i.e. the residual, is calculated:
(b) Taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x) i ,r mi ) I =1,2, \8230, N is used as the training data of the next tree to obtain a new regression tree f m (x) The corresponding leaf node region is R jm J =1,2, \ 8230;, J; wherein J is the number of leaf nodes of the regression tree t.
(c) Calculate best fit values for leaf region J =1,2, \8230;, J:
(d) Updating the strong learner:
3) Obtaining a final learner:
in order to enable those skilled in the art to better understand the above-described embodiments, a specific example is described below.
As shown in fig. 2, an abnormal point detection method based on ship data includes the following steps:
(1) Preposition segmentation of ship data set: the data (data) in a ship data set D provided by a user are randomly equally divided into K parts by using a K-Fold CV algorithm, wherein the serial number of each part of data is 1,2, \ 8230;, K, and K is obtained to obtain K sub-data sets (D) 1 ,D 2 ,…,D K )。
(2) Regression training subset data set and calculation trend accuracy: let i =1, the i-th data, i.e. the sub data set D i As Test set Test, the remaining data, i.e., the remainderUsing the subdata set as a training set Train, performing regression fitting on the training set Train by using a GBDT algorithm, calculating the trend accuracy of each sample in the Test set on the current Test set, enabling i = i +1, and repeating the process of regressing the training subdata set and calculating the trend accuracy until i>K;
(3) Determining the detected abnormal point: and according to the preset number m input by the user and the trend accuracy of each sample in the whole ship data set D, selecting the front m samples with the minimum trend accuracy as the detected abnormal points in the ship data set D.
And testing and verifying the test data set by combining the technical scheme shown in FIG. 2. Data sets and experimental results were as follows:
1. example data:
(1) Description of the data set: the ship data set of the ship wave-making resistance calculated by SHIPFLW software is selected in an experiment to test the abnormal point detection method shown in FIG. 2, wherein design parameters are draft (ship draught), halfbeam (half beam), height (Height), loa (total length), bulbLengthChange (protruding length change of bulbous bow), a target parameter is wave-making resistance eval _ CWTWC1, and the number of samples is 1211.
(2) Example parameter settings: the number of the abnormal points needing to be deleted each time, namely the preset number of the abnormal points, is set to be 1, and the hyper-parametric optimization of the model is started, so that the modeling work can be performed on a more accurate GBDT tree model, and whether the current abnormal point detection operation can effectively improve the modeling precision can be objectively judged.
(3) Evaluation indexes are as follows: mean Square Error (MSE) is chosen as an evaluation index to evaluate the performance of the model, and is defined as follows:
wherein n is an arithmetic number participating in the training of the agent model,evaluation value, y, of target variable for the ith example agent model i Is the true value of the target variable of the ith calculation example. The smaller the MSE, the higher the accuracy of the model.
2. Abnormal point detection result:
(1) Data set analysis: due to problems with grid parameter setting by the ship flow software and ship hull deformation rationality, some significant orders of magnitude of error solutions may occur when calculating large numbers of samples. Aiming at the ship data set used in the experiment test, the general magnitude order of eval _ CTWC value is 1e-4, therefore, before the abnormal point detection method is used, all data with overlarge magnitude order and other obvious abnormal values such as null values and the like are deleted from the ship data set.
After removing the data with obvious abnormality manually, drawing an eval _ CWTWC histogram of the ship data set, as shown in fig. 3, and it can be seen that the ship data set is basically within a reasonable data distribution range in the whole.
(2) And (3) abnormal point detection: after removing a plurality of abnormal points by using the abnormal point detection method, drawing an eval _ CWTWC histogram of the ship data set, as shown in fig. 4, it can be seen intuitively that part of the abnormal points in the ship data set are removed.
Outliers were incrementally removed during the experiment and the proxy model precision lifting (MSE dropping) process was visualized. In the GBDT algorithm, random _ state, which is different random number seeds, is selected for multiple experiments, for example, when random _ state is 117, 721, 1, and 1102, respectively, the corresponding test results are shown in fig. 5, 6, 7, and 8, respectively. As can be seen in fig. 5, 6, 7, and 8, the MSE decreases significantly with incremental removal of outliers. When a plurality of abnormal points are removed, the MSE reaches a local minimum value, then the abnormal points are continuously removed, and the MSE rises back to some extent, which indicates that the model used by the abnormal point detection method achieves a good fitting effect after the abnormal points are removed. Meanwhile, the application test result of the abnormal point detection method on the actual industrial data set further verifies the excellent performance of the abnormal point detection method on the aspect of mining the potential abnormal points in the industrial data set.
Another embodiment of the present disclosure relates to an abnormal point detecting apparatus based on ship data, as shown in fig. 9, including:
a segmentation module 901, configured to pre-segment the ship data set: equally dividing a ship data set provided by a user into a plurality of subdata sets;
a training module 902 for regression training the subdata set: taking one subdata set as a test set and the corresponding other subdata sets as training sets in turn, and respectively training a regression algorithm model on each training set to obtain a plurality of corresponding regression models;
a calculating module 903, configured to calculate a trend accuracy: respectively testing the regression models based on each test set, and respectively calculating the trend accuracy of each sample in each test set based on the test result to obtain the trend accuracy of each sample in the ship data set;
a determining module 904, configured to determine the detected abnormal point: based on the trend accuracy rate of each sample in the ship data set and the preset number of the abnormal points, selecting the corresponding sample from the ship data set according to a preset selection rule to obtain the abnormal points in the ship data set.
The specific implementation method of the ship data-based abnormal point detection device provided in the embodiment of the present disclosure may be referred to as the ship data-based abnormal point detection method provided in the embodiment of the present disclosure, and details thereof are not repeated here.
Compared with the prior art, the method and the device have the advantages that aiming at the problem that multiple mixed modes exist in the data set of the industrial design section or the consistency inside the data set is poor, the abnormal point detection device is used in the design of the ship profile driven by the industrial data set for the first time, the adaptability in the actual industry is considered, the trend accuracy of the sample is used as a judgment index to screen the abnormal points, the precision of data modeling is improved, the utilization rate of the ship profile data accumulated by a designer to an enterprise is improved, the application range is wide, and the intelligent design of the ship profile is effectively assisted.
Another embodiment of the present disclosure relates to an electronic device, as shown in fig. 10, including:
at least one processor 1001; and the number of the first and second groups,
a memory 1002 communicatively coupled to the at least one processor 1001; wherein,
the memory 1002 stores instructions executable by the at least one processor 1001, and the instructions are executed by the at least one processor 1001 to enable the at least one processor 1001 to execute the abnormal point detecting method based on ship data according to the above-described embodiment.
Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
Another embodiment of the present disclosure relates to a computer-readable storage medium storing a computer program that, when executed by a processor, implements the abnormal point detection method based on ship data according to the above-described embodiment.
That is, as can be understood by those skilled in the art, all or part of the steps in the method according to the foregoing embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method according to each embodiment of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific to implementations of the present disclosure, and that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure in practice.
Claims (10)
1. An abnormal point detection method based on ship data is characterized by comprising the following steps:
preposition segmentation of ship data set: equally dividing a ship data set provided by a user into a plurality of subdata sets;
regression training of the subdata set: taking one subdata set as a test set and the corresponding other subdata sets as training sets in turn, and respectively training a regression algorithm model on each training set to obtain a plurality of corresponding regression models;
calculating the accuracy rate of the trend: respectively testing the regression models based on the test sets, and respectively calculating the trend accuracy of each sample in each test set based on the test result to obtain the trend accuracy of each sample in the ship data set;
determining the detected abnormal point: and selecting corresponding samples from the ship data set according to a preset selection rule based on the trend accuracy of each sample in the ship data set and the preset number of abnormal points to obtain the abnormal points in the ship data set.
2. The ship data-based anomaly point detection method according to claim 1, wherein the trend accuracy rate of each sample in the test set is represented by the following formula (1):
wherein i =1, 2.. The, n is the sample number in the test set, n is the number of samples in the test set, y is the number of samples in the test set i For the true value of sample i in the test set,for the predicted value corresponding to sample i, c (y) in the test set i ) Is y i A vector of magnitude relationships to the actual values of other samples in the test set,is composed ofA vector of magnitude relationships of predictors corresponding to other samples in the test set,is c (y) i ) Andinner product of, | c (y) i ) I andare respectively c (y) i ) Andl2 norm of (d).
3. The method of claim 2, wherein c (y) is the number of outliers detected based on the ship data i ) Represented by the following formula (2):
c(y i )={c(y i ,y j )|j=1,2,...,n} (2)
wherein, y j Is the actual value, c (y), of a sample j other than sample i in the test set i ,y j ) Are intermediate variables and are represented by the following formula (3):
4. the method for detecting the abnormal points based on the ship data according to any one of claims 1 to 3, wherein the step of selecting the corresponding sample from the ship data set according to a preset selection rule based on the trend accuracy of each sample in the ship data set and the preset number of the abnormal points to obtain the abnormal points in the ship data set comprises the steps of:
and selecting the samples with the preset number before the minimum trend accuracy rate from the ship data set according to the size of the trend accuracy rate, and taking the samples with the preset number before the selection as abnormal points in the ship data set.
5. The method according to claim 4, wherein the selecting the preset number of samples with the minimum trend accuracy from the ship data set according to the magnitude of the trend accuracy comprises:
sequencing all samples in the ship data set according to the trend accuracy;
and selecting the samples with the minimum trend accuracy rate in the preset number from the sorted samples.
6. The ship data-based outlier detection method of any of claims 1-3, wherein the regression algorithm model is built based on a gradient-boosting decision tree.
7. The method of any one of claims 1 to 3, wherein the equally dividing the ship data set provided by the user into a plurality of sub data sets comprises:
and equally dividing the ship data set into the plurality of subdata sets by adopting a random sampling method.
8. An anomaly detection device based on ship data, characterized in that the device comprises:
the segmentation module is used for prepositioning and segmenting the ship data set: equally dividing a ship data set provided by a user into a plurality of subdata sets;
a training module to regress the training subdata set: taking one subdata set as a test set and the corresponding other subdata sets as training sets in turn, and respectively training a regression algorithm model on each training set to obtain a plurality of corresponding regression models;
a calculation module to calculate a trend accuracy: respectively testing the regression models based on the test sets, and respectively calculating the trend accuracy of each sample in each test set based on the test result to obtain the trend accuracy of each sample in the ship data set;
a determining module for determining the detected abnormal points: and selecting corresponding samples from the ship data set according to a preset selection rule based on the trend accuracy of each sample in the ship data set and the preset number of abnormal points to obtain the abnormal points in the ship data set.
9. An electronic device, comprising:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of vessel data based outlier detection of any of claims 1-7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the abnormal point detecting method based on ship data according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210444344.6A CN115600102B (en) | 2022-04-26 | 2022-04-26 | Abnormal point detection method and device based on ship data, electronic equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210444344.6A CN115600102B (en) | 2022-04-26 | 2022-04-26 | Abnormal point detection method and device based on ship data, electronic equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115600102A true CN115600102A (en) | 2023-01-13 |
CN115600102B CN115600102B (en) | 2023-11-21 |
Family
ID=84841822
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210444344.6A Active CN115600102B (en) | 2022-04-26 | 2022-04-26 | Abnormal point detection method and device based on ship data, electronic equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115600102B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701846A (en) * | 2023-08-04 | 2023-09-05 | 长江水利委员会长江科学院 | Hydropower station dispatching operation data cleaning method based on unsupervised learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975443A (en) * | 2016-05-04 | 2016-09-28 | 西南大学 | Lasso-based anomaly detection method and system |
CN108776694A (en) * | 2018-06-05 | 2018-11-09 | 哈尔滨工业大学 | A kind of time series abnormal point detecting method and device |
CN109840312A (en) * | 2019-01-22 | 2019-06-04 | 新奥数能科技有限公司 | A kind of rejecting outliers method and apparatus of boiler load factor-efficiency curve |
CN113591400A (en) * | 2021-08-23 | 2021-11-02 | 北京邮电大学 | Power dispatching monitoring data anomaly detection method based on feature correlation partition regression |
CN113836118A (en) * | 2021-11-24 | 2021-12-24 | 亿海蓝(北京)数据技术股份公司 | Ship static data supplementing method and device, electronic equipment and readable storage medium |
-
2022
- 2022-04-26 CN CN202210444344.6A patent/CN115600102B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975443A (en) * | 2016-05-04 | 2016-09-28 | 西南大学 | Lasso-based anomaly detection method and system |
CN108776694A (en) * | 2018-06-05 | 2018-11-09 | 哈尔滨工业大学 | A kind of time series abnormal point detecting method and device |
CN109840312A (en) * | 2019-01-22 | 2019-06-04 | 新奥数能科技有限公司 | A kind of rejecting outliers method and apparatus of boiler load factor-efficiency curve |
CN113591400A (en) * | 2021-08-23 | 2021-11-02 | 北京邮电大学 | Power dispatching monitoring data anomaly detection method based on feature correlation partition regression |
CN113836118A (en) * | 2021-11-24 | 2021-12-24 | 亿海蓝(北京)数据技术股份公司 | Ship static data supplementing method and device, electronic equipment and readable storage medium |
Non-Patent Citations (2)
Title |
---|
王丁;: "基于无线局域网的船舶异常数据自动识别系统", 舰船科学技术, no. 22 * |
胡淼;王开军;李海超;陈黎飞;: "模糊树节点的随机森林与异常点检测", 南京大学学报(自然科学), no. 06 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701846A (en) * | 2023-08-04 | 2023-09-05 | 长江水利委员会长江科学院 | Hydropower station dispatching operation data cleaning method based on unsupervised learning |
Also Published As
Publication number | Publication date |
---|---|
CN115600102B (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Psorakis et al. | Multiclass relevance vector machines: sparsity and accuracy | |
EP4290412A2 (en) | Computer-implemented method, computer program product and system for data analysis | |
Levatić et al. | Self-training for multi-target regression with tree ensembles | |
EP3803714A1 (en) | Systems and methods for decomposition of non-differentiable and differentiable models | |
Angstenberger | Dynamic fuzzy pattern recognition with applications to finance and engineering | |
Ezghari et al. | A new nearest neighbor classification method based on fuzzy set theory and aggregation operators | |
CN103221945A (en) | Multivariate data mixture model estimation device, mixture model estimation method, and mixture model estimation program | |
Yu et al. | Control chart recognition based on the parallel model of CNN and LSTM with GA optimization | |
Beyerer et al. | Machine Learning for Cyber Physical Systems: Selected Papers from the International Conference ML4CPS 2018 | |
Bodyanskiy | Computational intelligence techniques for data analysis | |
CN117455070A (en) | Traditional Chinese medicine production data management system based on big data | |
Braylan et al. | Modeling and aggregation of complex annotations via annotation distances | |
CN118016279A (en) | Analysis diagnosis and treatment platform based on artificial intelligence multi-mode technology in breast cancer field | |
CN116539994A (en) | Substation main equipment operation state detection method based on multi-source time sequence data | |
CN115600102A (en) | Abnormal point detection method and device based on ship data, electronic device and medium | |
US20040083013A1 (en) | Method of operating a plant and a control system for controlling the same | |
Wang | Higher education management and student achievement assessment method based on clustering algorithm | |
CN107203916B (en) | User credit model establishing method and device | |
Xie et al. | A nonparametric Bayesian framework for uncertainty quantification in stochastic simulation | |
Bourdache et al. | Active preference elicitation by bayesian updating on optimality polyhedra | |
CN115600121B (en) | Data hierarchical classification method and device, electronic equipment and storage medium | |
Wang et al. | A fuzzy modeling method via Enhanced Objective Cluster Analysis for designing TSK model | |
EP3460723A1 (en) | Evaluating input data using a deep learning algorithm | |
Gan | Stock Price Simulation under Jump-Diffusion Dynamics: A WGAN-Based Framework with Anomaly Detection | |
Ingram et al. | Glint: An MDS Framework for Costly Distance Functions. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |