CN113032378B

CN113032378B - Ship behavior pattern mining method based on clustering algorithm and pattern mining

Info

Publication number: CN113032378B
Application number: CN202110247443.0A
Authority: CN
Inventors: 李永; 陈菲娅
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2024-07-19
Anticipated expiration: 2041-03-05
Also published as: CN113032378A

Abstract

A ship behavior pattern mining method based on a clustering algorithm and pattern mining relates to the field of pattern mining of ships. The method mainly comprises the steps of obtaining tracks stored in a track database of a ship, performing calculation of two processes of data cleaning and track compression, clustering the processed track data to generate clusters, and taking the clusters as frequent item sets to perform frequent sequence mining. The invention optimizes the key point clusters in the data, can effectively utilize the clustering algorithm to excavate various ship data, thereby being capable of adapting to the characteristic of uneven large-scale data density of the ship and improving the quality and accuracy of behavior pattern excavation.

Description

Ship behavior pattern mining method based on clustering algorithm and pattern mining

Technical Field

Aiming at the characteristics of ship tracks, the invention designs a ship behavior pattern mining method by utilizing an improved clustering algorithm and a pattern mining algorithm, thereby realizing the mining of the behavior tracks of the ship in the course of navigation; relates to the field of mode excavation, in particular to the field of mode excavation of ships.

Background

With the development of emerging technologies such as big data, cloud service, artificial intelligence and the like in recent years, the ocean construction of China is also informationized and dataized. Meanwhile, the precision of basic technologies such as a positioning system, communication equipment, a sensor network and the like is improved, and the application is also wide. The large-scale data can be accurately collected and well stored due to the development of various technologies, the complicated large-scale data of ships is widely concerned by government, national defense units and related enterprises, and the full utilization of the data is expected through different technical methods. The ship behavior pattern mining based on the clustering algorithm and pattern mining can form a target historical behavior template by utilizing the track data of the ship through clustering and mining means, so that important auxiliary and reference functions can be provided for the application and processing of ship data.

Trajectory data is a set of moving points of a moving object in space-time, wherein the moving points comprise longitude and latitude, altitude, speed, time and other information of the object, and the trajectory data is a snapshot in the behavior activity of the moving object. Meanwhile, with the continuous development of navigation information technology, the automatic ship identification system (Automatic Identification System, AIS) is forcedly installed on the ship, so that massive AIS information data is generated, and the information data contains rich ship information. From the original track to the behavior pattern application, the original track data collected by the equipment can be filtered and processed through means such as data cleaning, data preprocessing and the like, noise, redundancy, stay points and the like, and then the data is used for pattern mining to form a relatively more accurate and reliable historical behavior track template of the moving target, and the activity rule seen by the target individual or group is displayed. By analyzing the track data through various technical means, the activity rules among the individuals and groups of the movable objects can be mined, and further support can be provided for track prediction, traffic planning and sea and air target monitoring.

Cluster analysis is an effective means of data mining, and has been widely used in fields of pattern recognition, data analysis, image processing, and the like. Frequent pattern mining is one of the directions of trace data pattern mining.

In the analysis of the clustering method, the fact that the ship track points are difficult to accurately repeat and the ship data density distribution is irregular is considered, meanwhile, the density-based clustering method is low in complexity, and the irregular-shaped data sets can be clustered according to the data density distribution characteristics. Therefore, the algorithm based on density clustering is determined to be more suitable for the characteristics of the ship data area dispersion non-shape rule and large data scale. The DBSCAN algorithm is a typical representation in a density-based spatial clustering algorithm, can divide a high-density point region into clusters, effectively filters a low-density point region, and can realize clustering of any shape in a data set containing noise. However, the DBSCAN algorithm is very sensitive to the parameters of Eps and MinPts, and incorrect values can cause poor clustering effect and even incorrect clustering effect.

According to the invention, the ship data set is clustered through the improved DBSCAN algorithm by preprocessing the ship original data set, clusters generated by clustering are used as data frequent items for pattern mining, and the pattern mining algorithm is used for mining the data frequent items, so that a behavior template sequence of the ship is mined.

Disclosure of Invention

The invention provides a ship behavior pattern mining method based on an improved clustering algorithm and a pattern mining algorithm.

The invention provides a method for mining a ship behavior pattern, and optimizes key point clusters in the behavior pattern, so that the clustering algorithm can be effectively utilized to mine various ship data, thereby being capable of adapting to the characteristic of uneven large-scale data density of ships and improving the quality and accuracy of mining the behavior pattern.

The invention adopts the following technical scheme and implementation steps:

a ship behavior pattern mining method based on a clustering algorithm and pattern mining,

Redundancy and noise are present for the original track data, which causes great inconvenience to the subsequent analysis, and therefore the original track data needs to be preprocessed to obtain track data for the subsequent analysis. And carrying out track clustering on the preprocessed track data, defining a frequent item set, and carrying out excavation of a frequent pattern.

The method is characterized by comprising the following steps of:

(1) The data cleaning and compressing method is provided for the characteristics of large space-time span and gentle steering amplitude of the original track target track:

① Cleaning of the data set is necessary because the raw data collected may have outliers. The main rules are as follows:

the time interval between the starting point and the end point of the track segment is more than one day (24 hours), and then the track segment is divided;

The maximum speed v _max is defined, which is the maximum speed of the ship, and the maximum speed v _max is set to be 110 km/h according to the ship overview and knowledge graph published by national defense industry press. Assuming that the locus point p (lon _p,lat_p,t_p) is the previous locus point of the locus point q (lon _q,lat_q,t_q) in the locus section, t _p＞t_q, the speed between two points can be calculated as:

Wherein lon _p is the longitude value of the track point p, lat _p is the latitude value of the track point p, and t _p is the time of generating the track point p; similarly, lon _q is the longitude value of the track point q, lat _q is the latitude value of the track point q, and t _q is the time when the track point q is generated. HAVERSINE (lat _p,lon_p,lat_q,lon_q) is the distance between two latitude and longitude points calculated by HAVERSINE formula. If v _pq≥v_max indicates that the speed of q point generation is abnormal, defining q point as abnormal point, and deleting the point.

② Track data is generally collected in seconds, so the amount of the track data is large, but when the data analysis is performed, the operation efficiency is affected by the large amount of the track data, and a plurality of track points are unnecessary to analyze. For the convenience of calculation, the track data needs to be compressed, and a Douglas-Peucker (DP) algorithm is a commonly used track compression algorithm, and the Douglas-Peucker algorithm is used for compression, so that an offset threshold value threshold from a point to a track straight line needs to be set. Selecting a total of 102584 track points of 8 targets in the experiment, and comparing the compression ratio, the compression time and the track error after compression under different thresholds to obtain the increase of the compression ratio and the compression error of the track along with the increase of the compression threshold; the track increases with the track threshold and the compression time of the track correspondingly decreases. To achieve a balance of computation time and computation error, setting the trajectory compression threshold to 0.8km will result in a relatively accurate compression effect in a short computation time.

The main flow is as follows:

all points on the track segment are arranged in time order.

1) A straight line AB is connected between the first and the last points A and B of the track section, and the straight line is the chord of the track section

2) Obtaining a point C with the largest distance from the straight line segment on the track segment, and calculating the distance d between the point C and the AB;

3) Comparing the distance with a preset threshold value threshold, and if the distance is smaller than the threshold, the straight line segment is used as an approximation of the track segment, and the track of the segment is processed.

4) If the distance is greater than or equal to the threshold value threshold, the track section is divided into two sub track sections AC and BC by C, and the two sections are respectively subjected to 1-3 chord taking processes.

When all the sub track sections are processed, the broken lines formed by the dividing points are connected in sequence, and the broken lines can be used as approximations of the track sections.

(2) Designing an improved DBSCAN clustering method:

① Parameter Eps neighborhood: the Eps neighborhood of an object p refers to a region centered on the object p and having an Eps radius, namely:

N_eps(p)＝{q∈D|Dist(p,q)≤Eps}； (2)

wherein D is a data set; dist (p, q) is the distance between object p and object q; n _eps (p) refers to the set of points contained within the D-dimensional hypersphere region of dataset D object p with Eps as radius.

Obtaining neighborhood values Eps other than the first cluster using kernel density estimation

For a ship track data set D, n sample points x ₁,x₂,x₃,…,x_n which are independently and uniformly distributed exist in the data set, and a probability density function of the data set D is set as f (x), wherein the kernel density estimation form of f (x) is as follows:

Wherein, As a kernel function (non-negative, integral 1, mean 0), i=1, 2, …, n; is a scaling kernel function (SCALED KERNEL), whereby, I=1, 2, …, n; h is the bandwidth, also called window (h > 0), n is the number of samples.

Bandwidth is a free parameter that has a large impact on the resulting estimate. Then for the choice of h, the estimated probability density function can be determined using the integrated mean square Error (MEAN INTERGRATED Squared Error)And the true probability density function f (x), expressed as

Where E () represents the mathematical expectation of the variables in brackets.

Under weak hypothesis

Where o () represents the higher order infinitely small of the variable in brackets; AMISE is the progressive integral mean square error, and AMISE has

To minimize MISE (h), the transformation is to a pole problem

Thus optimum bandwidth

In the above expression for the bandwidth h, the second derivative f "(x) of the existence probability density function f (x) represents the degree of concavity and convexity of each point of the density estimation function, the gaussian density kernel function is taken as the kernel function to perform kernel density estimation, and the optimal choice of h (i.e., the bandwidth with minimized integral mean square error) is

Wherein,Representing the sample variance. For the data set D, the number of data set samples is used for obtaining the nuclear density estimated optimal bandwidth h, and h is used as an Eps initial value for clustering the data set.

② The parameter MinPts density threshold describes the threshold for the number of samples in a neighborhood of a certain sample distance Eps.

For the data set D, traversing the data set D, and recording the number M of the objects of each data point in the Eps neighborhood, wherein the value M can be used as the basis of the density distribution of the data set. Selecting a data point with the maximum M value in the data set D as a first core object D ₁, acquiring the M value of D ₁ as an initial MinPts, clustering the first cluster, selecting a data object with the maximum M value from the data objects which are not clustered after the clustering is finished as a core object, and carrying out the next clustering. Dynamic update obtains the density threshold MinPts of cluster classes other than the first cluster

Wherein M (n) is the M value of the current core object; m (max) records the M value of D ₁; Representation of To the power of 2; minPts' is the pre-update density threshold. Once per cluster, the density threshold MinPts changes with the density value M of the first core object neighborhood at the beginning of each cluster class.

(3) And performing sequence mining on the frequent item sets.

Through clustering of the ship data set, frequent items of tracks represented by a cluster center and a cluster-like distance range and tracks formed by the cluster center are finally generated, and a PrefixSpan (Prefix-Projected Pattern Growth, prefix projection mode mining) algorithm is utilized to mine out frequently-occurring partial sequence rules from a large number of partial sequence phenomena, so that a behavior mode of the ship is obtained.

The invention mainly comprises the following steps:

(1) According to the invention, on the basis of a clustering algorithm and pattern mining, the behavior of the ship is subjected to pattern mining, cleaning, compression and clustering are provided, the clustering result is used as an entrance of the template mining, and the behavior of the ship is effectively mined according to the characteristics of redundancy of ship data, large data volume, gentle steering angle and difficulty in accurate repetition of ship track, so that the navigation rule of the ship can be obtained;

(2) Compared with the prior clustering algorithm, the clustering method is improved, the improved algorithm can enable the clustering process to be more fit with sample characteristics, the quality of a clustering result is higher, so that the final behavior pattern mining is more accurate, the utilization rate of ship data is higher, and more accurate reference can be provided for the utilization of ocean data.

Drawings

FIG. 1 is a schematic view of the overall structure of the present invention;

FIG. 2 is a schematic diagram of the clustering method of the present invention.

Detailed Description

According to the invention, the ship AIS data is selected as the original data, the ship behavior mode is mined by using a clustering algorithm and a mode mining algorithm, and the following technical scheme and the realization steps are adopted.

The ship behavior pattern mining method based on the clustering algorithm and pattern mining specifically comprises the following steps:

1. The original data set of the ship is cleaned and compressed, and the method comprises the following steps:

step 1: traversing the track data set, segmenting the track with the time interval between the starting point and the end point of the track segment being greater than one day (24 hours) according to the time difference requirement, and storing the segmented track data set.

Step2: traversing all sub-track segments of a track for the segmented track data set, judging that if the speed of the track segment is larger than a given speed value according to the set maximum speed v _max value of 110 km/h, taking the start point of the track segment as the end point of the previous track and the end point as the start point of the next track. The abnormal points are divided into track segments with the length of 1, and the abnormal points are removed by removing the track segments with the length of 1.

Step 3: the data sets which are washed are compressed one by one according to track segments by using a Fabry-Perot algorithm, and the compressed track data sets are stored.

2. Compressing the processed track data set by using the improved DBSCAN clustering method

Step 1: setting an initial Eps neighborhood threshold:

Estimation from nuclear density

To select the optimal bandwidth h value, a method is used that minimizes the average integral squared error

AMISE has

According to the above formula, to find the minimum h value, MISE (h) needs to be minimized to find the pole

Using kernel density estimation with gaussian kernel function, the best choice for h is

And n is the number of objects in the data set, and the value of h is assigned to the neighborhood threshold Eps for clustering.

Step 2: traversing the data set D, and recording the number M of objects of each data point in the Eps neighborhood.

Step 3: selecting the data point with the maximum M value as a first core object, performing DBSCAN clustering on the first cluster by using the parameters of Eps and MinPts, and marking the clustered data points until the clustering is finished.

Step 4: selecting a data object with the maximum M value from the data objects which are not clustered as a core object, and dynamically acquiring a cluster density threshold MinPts except for the first cluster

Clustering was performed using the Eps and MinPts parameters.

Step 5: and (4) repeating the step until the rest objects cannot be used as core objects.

Step 6: the track consisting of cluster centers and the range of each cluster center and cluster are recorded.

3. And carrying out pattern mining on the clustering result.

And taking the clustering center and the clustering range as frequent items, and carrying out mode mining on tracks formed by the clustering centers by using PrefixSpan algorithm to obtain a frequent sequence with time sequence, wherein the sequence is a behavior template of the ship and is an operation rule in the ship navigation process.

Claims

1. The ship behavior pattern mining method based on the clustering algorithm and pattern mining is characterized by comprising the following steps of:

(1) Data cleaning and compression:

① Cleaning the data set is necessary because the collected raw data may have outliers; the rules are:

the time interval between the starting point and the end point of the track segment is greater than 24 hours, and then the track segment is divided;

Defining a maximum speed v _max, and setting the maximum speed v _max to be 110 km/h; assuming that the locus point p (lon _p,lat_p,t_p) is the previous locus point of the locus point q (lon _q,lat_q,t_q) in the locus section, t _p>t_q, the speed between the two points is calculated as:

Wherein lon _p is the longitude value of the track point p, lat _p is the latitude value of the track point p, and t _p is the time of generating the track point p; similarly, lon _q is the longitude value of the track point q, lat _q is the latitude value of the track point q, and t _q is the time generated by the track point q; HAVERSINE (lat _p,lon_p,lat_q,lon_q) is the distance between two latitude and longitude points calculated by HAVERSINE formula; if v _pq≥v_max indicates that the speed generated by the q point is abnormal, defining the q point as an abnormal point, and deleting the point;

② Compressing the track data, and setting a track compression threshold value threshold to be 0.8km in order to balance a calculation time and a calculation error;

The flow is as follows:

arranging all points on the track section according to time sequence;

3) Comparing the distance with a preset threshold value threshold, and if the distance is smaller than the threshold, taking the straight line segment as an approximation of a track segment, wherein the track segment is processed;

4) If the distance is greater than or equal to a threshold value threshold, dividing the track section into two sub track sections AC and BC by using C, and respectively carrying out 1) to 3) of processing on the two sections of strings;

When all the sub track sections are processed, the broken lines formed by all the dividing points are connected in sequence, namely, the broken lines are used as approximations of the track sections;

(2) Designing an improved DBSCAN clustering method:

N_eps(p)＝{q∈D|Dist(p,q)≤Eps}； (2)

wherein D is a data set; dist (p, q) is the distance between object p and object q; n _eps (p) refers to the set of points contained within the D-dimensional hypersphere region of dataset D object p with Eps as radius;

Wherein, As a kernel function, the kernel function is non-negative, the integral is 1, the average value is 0, i=1, 2, …, n; Is a scaling kernel function, whereby, H is the bandwidth, also called window, n is the number of samples; h >0;

For the selection of h, the estimated probability density function is determined using the integrated mean square error And the true probability density function f (x), expressed as

Where E () represents the mathematical expectation of the variables in brackets;

Under weak hypothesis

To minimize MISE (h), the transformation is to a pole problem

Thus optimum bandwidth

In the expression of bandwidth h, the second derivative f' (x) of the existence probability density function f (x) represents the degree of concavity and convexity of each point of the density estimation function, the Gaussian density kernel function is taken as the kernel function to perform kernel density estimation, and the optimal choice of h is that the bandwidth with minimized integral mean square error is

Wherein,Representing the sample variance; aiming at a data set D, obtaining an optimal bandwidth h of kernel density estimation by using the number of data set samples, and taking h as an Eps initial value for clustering the data set;

② The parameter MinPts density threshold describes the threshold of the number of samples in a neighborhood with the distance of Eps of a certain sample;

Traversing the data set D aiming at the data set D, and recording the number M of objects of each data point in the Eps neighborhood, wherein the value M is used as the basis of the density distribution of the data set; selecting a data point with the maximum M value in the data set D as a first core object D ₁, acquiring the M value of D ₁ as an initial MinPts, clustering the first cluster, selecting a data object with the maximum M value from the data objects which are not clustered after the clustering is finished as a core object, and carrying out the next clustering; dynamic update obtains the density threshold MinPts of cluster classes other than the first cluster

Wherein M (n) is the M value of the current core object; m (max) records the M value of D ₁; Representation of To the power of 2; minPts' is the pre-update density threshold; once per cluster, the density threshold MinPts changes with the density value M of the first core object neighborhood at the beginning of each cluster class;

(3) Sequence mining is carried out on frequent item sets;

and finally generating frequent items of tracks represented by the cluster centers and the cluster-like distance ranges and tracks formed by the cluster centers through clustering of the ship data sets, and mining out frequently-occurring partial sequence rules from the partial sequence phenomenon, so as to obtain the behavior mode of the ship.