CN113780395B

CN113780395B - Mass high-dimensional AIS trajectory data clustering method

Info

Publication number: CN113780395B
Application number: CN202111012775.7A
Authority: CN
Inventors: 廖泓舟; 代翔; 潘磊; 高翔; 崔莹; 陈伟晴
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-02-03
Anticipated expiration: 2041-08-31
Also published as: CN113780395A; WO2023029461A1

Abstract

The method for clustering the massive high-dimensional AIS track data, disclosed by the invention, has the advantages of high accuracy and high running speed. The invention is realized by the following technical scheme: dividing the data into a plurality of tracks according to the course information, preprocessing AIS track data and performing linear interpolation and data completion; inputting the preprocessed AIS track data into a self-encoder network for reconstruction training, and outputting a reduced-dimension track characteristic embedding point; clustering the track characteristic embedded points based on a k-means algorithm of Euclidean distance to obtain initial clustering points; adding a pre-trained encoder into a clustering layer to construct a deep clustering network, respectively calculating the soft distribution probability of a track characteristic embedding point distributed to an initial clustering point and the auxiliary distribution probability belonging to a certain cluster, calculating KL divergence of the track characteristic embedding point and the initial clustering point by adopting a gradient descent algorithm, and stopping a clustering process when the cluster distribution change between continuous iterations is smaller than a set value to obtain a final clustering result.

Description

Mass high-dimensional AIS (automatic identification System) track data clustering method

Technical Field

The invention relates to a data clustering technology, in particular to an Automatic Identification System (AIS) track clustering method based on deep embedded clustering, which is used for solving the problem of clustering mass high-dimensional ship AIS track data.

Background

The space-time trajectory is a recording sequence of the position and time of a mobile object, is an important space-time object data type, is widely applied in the fields of traffic flow patterns and characteristic research, resource allocation, sea ice monitoring and the like, and can obtain similarity characteristics in the space-time trajectory data by analyzing various space-time trajectory data to find a meaningful trajectory pattern therein. The ship track data is one of space-time track data, and records the navigation process and corresponding behavior characteristics of the ship. With the wide application of the automatic identification system AIS of the ship on the ship, the acquisition of the track data of the ship is easier, the AIS track data of the ship comprising attributes such as ship position, time, ship speed, course, turning angular speed and the like is a data source for analyzing ship aggregation characteristics, and how to dig out valuable information contained in the mass data has important significance for researching ship traffic behavior patterns and analyzing ship traffic flow characteristics.

The AIS system is a short name for an automatic identification system (automatic identification system) of a ship, consists of shore-based (base station) facilities and shipborne equipment, and is a novel digital navigation aid system and equipment integrating a network technology, a modern communication technology, a computer technology and an electronic information display technology. The automatic identification system AIS of the ship can continuously send relevant information of the ship, and the data can be received through the AIS receiver. Generally, the AIS receiving stations are deployed on the land, so that only ship data within about 60 kilometers of the stations can be received, and the satellite AIS is provided with an AIS receiver, so that the AIS information of global ships can be received without regional limitation. With the continuous improvement of the ship traffic density of port water areas, the ship navigation conditions in the water areas are more and more complex, and higher requirements are also put forward on the ship traffic management capacity.

According to the characteristics of the ship AIS track data, the ship track data AIS data mainly comprises ship static information, dynamic information and voyage number related information. The AIS track data of the ship is mainly obtained through an AIS base station. During navigation, information such as position, speed and the like of the global positioning system is generally directly accessed, and the information is transmitted outwards through the ship AIS transmitter and received by a nearby ship or a shore-based AIS receiver. In the process of collecting the ship AIS data, errors or mistakes may occur in the links of manual input of ship drivers, transmission of the AIS information, storage after the information is collected and the like, and the original AIS data usually has the conditions of time disorder, data abnormality, data loss and unequal track point number, so that the data needs to be preprocessed before being used for improving the quality of the ship AIS track data. The ship AIS trajectory data preprocessing generally comprises the following aspects:

1) And processing missing data. The missing data processing is mainly directed to static data in the ship track data, such as the ship name, the ship width, the ship type and the like, and the data can be checked through a ship directory or a ship database of a maritime administration. In the event of a dynamic data miss, the piece of data is typically treated as erroneous data.

2) And (5) reducing the dimensionality. The ship track data contains more attributes, but not all the attributes are required, and the unnecessary attributes can be eliminated according to the actual condition of research, so that the simplified representation of the data set is obtained. For example, when only the ship track space information is researched, only the ship position and the ship name attribute can be reserved, and other attributes can be eliminated.

3) The numerical concept is hierarchical. For some ship track data, such as ship length, ship width, ship tonnage and the like, concept layering processing can be carried out according to the actual situation in application. For example, container ships can be classified into ultra-large container ships, medium container ships and small container ships according to the length of the ship, and the ships are subjected to corresponding numerical concept layering.

The clustering of the ship track clustering method is to divide objects with similar behaviors into the same group, so that the difference in the group is as small as possible, and the difference between the groups is as large as possible. The aim of the ship AIS track clustering is to cluster the track data by adopting a related clustering algorithm, find out track clusters with similar ship motion evolution modes, reveal potential relations among ship tracks and analyze ship traffic flow characteristics or individual ship behaviors. The ship AIS track clustering method based on the distance is essentially to divide objects according to the similarity of track data, and the result of clustering division is to optimize an evaluation function representing the clustering quality, so how to evaluate the distance or the similarity between the track data is one of the key problems of clustering processing. Clustering is an unsupervised data mining method, an original data set is divided into a plurality of clusters through similarity measurement between objects, the similarity of the objects in the clusters is high, and the similarity of the objects between the clusters is low. Track data clustering firstly obtains the similarity degree between tracks by analyzing and comparing track characteristic information, and then classifies the tracks with high similarity degree into one class. By carrying out cluster analysis on the AIS track data of the ship, effective support can be provided for technologies such as typical route extraction, abnormal track discovery, navigation track prediction, traffic flow analysis and the like, and the method has important application value for solving the problem of ship navigation safety and improving port entering and exiting efficiency. However, compared with common pedestrian and automobile tracks, the ship AIS track data includes various attribute information such as a ground speed, a ground heading, a bow direction, a navigation state, a ship type and the like in addition to space-time attributes, and is large in data volume and large in characteristic dimension, and belongs to typical space-time track big data.

The conventional ship AIS track clustering method mainly comprises two steps: (1) Similarity measurement for measuring similarity between tracks; and (2) clustering, and classifying similar tracks into one type.

Similarity measures are usually measured as the distance between two tracks, and are usually measured as Euclidean distance (Euclidean), hausdorff Distance (HD), dynamic time warping Distance (DTW), freund's Distance (FD), etc. The clustering mainly comprises a partition-based clustering algorithm represented by K-means, a hierarchical clustering algorithm represented by BIRCH, a grid-based clustering algorithm represented by STING, a spectral clustering algorithm represented by SpectraCluster and a density-based clustering algorithm represented by DBSCAN. The inter-track similarity measurement method based on distance is a commonly used method, and among them, an algorithm based on hausdorff distance, an algorithm based on longest common subsequence (LCSS), and an algorithm based on Edit Distance (ED) are commonly used measurement methods. Due to the defects of the K-Means clustering method, such as the requirement for specifying the clustering number, the clustering result is often seriously influenced by the initial clustering center, and the like, the article does not improve the defects, so that the clustering result of the ship track information is influenced by the adverse factors. The ship track is regarded as a time sequence by VRIES and the like, the similarity of the track is calculated by adopting DTW and ED, and a method of track compression is combined, and a kernel k-means method is utilized to cluster the AIS track of the ship. Although ED can be used for calculating similarity between tracks in ship track clustering and can overcome the gap problem of DTW, the ED still has the problems of large calculation amount and sensitivity to abnormal tracks. Malwinia et al have proposed a spectral clustering ship motion pattern identification method based on one-way distance, this algorithm defines the one-way distance as the mean value of the minimum distance from each point to each point in another ship's track on a ship track, utilize the anti-interference characteristic of one-way distance, construct the ship AIS orbit similarity measurement based on one-way distance, get the similarity matrix of the ship AIS orbit, learn the spatial distribution of the orbit with the spectral clustering algorithm, obtain the normal motion pattern of the ship, however, because the ship orbit sampling frequency is high, the data bulk is large, need to calculate the one-way distance one by one, therefore, there is a problem that the calculated amount is large. The distance-based clustering method has the advantages of simple algorithm and easiness in realization for ship track clustering, but due to the defect of the distance-based track similarity measurement method, the track local characteristic information is still easy to lose.

The DBSCAN algorithm treats each object as a particle, but the sizes of each ship in the actual ship traffic flow are inconsistent, so that the clustering result cannot well reflect the real traffic flow condition of the water area in a small water area. LIU and other people improve the DBSCAN algorithm on the basis of considering the non-spatial attributes (such as ship speed, course and the like) in ship track data, newly add two variables of the maximum ship speed variation (MaxSpd) and the maximum course variation (MaxDir) of a ship in input parameters, redefine a core object under the condition of comprehensively considering the neighborhood, the ground speed and the ground course of an object, and adjust the MaxSpd and the MaxDir according to the definition of an international maritime organization on a navigation path to realize the clustering of ship tracks and extract a main navigation path of the ship tracks, compared with a distance-based ship track clustering method, the DBSCAN and the improved algorithm are adopted to cluster the ship tracks, the advantages are mainly shown in the fact that ship track clusters in any shape can be found, the robustness on abnormal ship tracks is strong, the structure of the track clustering is irrelevant to the sequence of sample tracks, however, some defects also exist, the advantages are mainly shown in that when the density of the ship tracks is uneven and the inter-cluster distances are different, the clustering quality is greatly improved, and the subjective number of the objects and the clustering and the sample track of the neighboring people are influenced and the clustering radius of the clustering are relatively poor; meanwhile, when the data volume of the ship track is increased, a large memory support is required, and the I/O consumption is also large. Because most of processing of the track clustering method is limited to an original data space, in the aspect of ship AIS track data with large data quantity and high dimensionality, the clustering precision and the clustering efficiency are low, and because the similarity measurement of the track and the separation of the clustering tasks cannot ensure that the extracted track features are suitable for the clustering tasks, the clustering quality is influenced. Even though the most widely applied DBSCAN algorithm in track clustering can find clusters of any shape and is not sensitive to noise, the track feature selection still needs to be carried out manually, two parameters of radius (eps) and minimum contained point number (minPts) are preset, and when the track data density is not uniform, the clustering effect is poor and the calculation efficiency is low. The methods have various characteristics, for example, although the distance-based method is simple in theory and easy to implement, local information of the track is easy to lose in the clustering process, the density-based method can cluster ship tracks in any shape, but when the density of ship track data clusters is not uniform, the clustering effect is poor, and the statistical-based method is based on a mature mathematical method, but has the defect of high calculation complexity. As the ship AIS trajectory data is multidimensional space-time data and large in data volume, the cluster analysis of the ship AIS trajectory data has some problems to be solved in the technology, such as how to efficiently process massive ship trajectory data, how to better express multidimensional attributes in the ship AIS trajectory data in clusters, how to realize the cluster analysis of the ship AIS trajectory under the condition of fully considering natural conditions such as wind, current, visibility and the like, and the like.

Conventional Clustering algorithms can be classified into partitional Clustering algorithms (e.g., K-means), graph-based Clustering algorithms (e.g., spectral Clustering), hierarchy-based Clustering algorithms (e.g., AGNES), and so on. The traditional clustering algorithm applies two most widely algorithms: keans clustering and spectral clustering. K-means clustering classifies categories by determining the cluster center and calculating the distance of each data point to the cluster center. General clustering algorithms such as K-means, GMM, are fast and suitable for various problems, but their distance measure is limited to the original data space, and they are often ineffective when the input dimension is high. The method has the advantages that the traditional distance algorithm based on distance measurement, such as K-means, is directly adopted, the calculation of the Euclidean distance on the original pixel high-dimensional image data set is not efficient enough, namely when the dimensionality is high, the calculation is time-consuming; the traditional algorithm of firstly reducing the dimension and then clustering can only carry out linear embedding learning on the original data, so that a plurality of important features are lost; spectral clustering is one of the most popular clustering algorithms, the implementation is simple, and the effect is often superior to that of the traditional clustering algorithm, such as K-means. The main idea is that all data are regarded as points in space, the points are connected by weighted edges, the edge weight between points with longer distance is lower, the edge weight between points with shorter distance is higher, the graph formed by all data points and edges is cut, the sum of the edge weights between different sub-graphs is as low as possible after the graph is cut, and the sum of the edge weights in the sub-graphs is as high as possible to achieve the purpose of clustering. Although the spectral clustering algorithm can perform clustering well in a high-dimensional data set, when the data set becomes large, the memory and the operation resources for calculating the feature matrix are increased violently. The traditional clustering algorithm can hardly achieve ideal clustering effect on high-dimensional data sets. For example, spectral clustering algorithms may require high memory computation consumption for high dimensional data sets and are not suitable for use in actual large data set clustering. Although there are related methods for performing dimensionality reduction on original high-dimensional data, the related dimensionality reduction methods can only perform dimensionality reduction on linear data, but cannot perform nonlinear relationship mapping processing on the original high-dimensional data. In response to this problem, clustering algorithms based on deep learning have attracted research interest of many scholars in recent years. The deep learning technology is widely used in the fields of computer vision, image processing and the like, and is proved to be effective in processing high-dimensional data. The learning of the deep neural network parameters generally depends on supervised labels to guide the learning, and in the unsupervised clustering process, the labels cannot be used to guide the parameter updating of the network. Most of the current track clustering methods are limited to original data space, when the track data volume is large, the clustering effect and the efficiency are low, and because the similarity measurement of the track is separated from the clustering task, the extracted features cannot be ensured to be suitable for the clustering task, and the clustering precision and the efficiency are influenced.

In the big data era, the data volume is rapidly expanding, the data is more and more complex, and the dimensionality of the data is increased. In a two-dimensional mapping space, the space that can accommodate intermediate-distance spaced points (in a high-dimensional space) is not much larger than the space that can accommodate close points (in a high-dimensional space). In other words, even points that are far apart in the high dimensional space do not leave so much space in the low dimensional space to map. Then, the points in the last high-dimensional space, especially the points at long distance and middle distance, are generally blocked in the low-dimensional space, which is called "congestion Problem". The congestion problem means that the clusters are clustered together and cannot be distinguished. For example, in a case of one, high-dimensional data can be well expressed when the dimension is reduced to 10 dimensions, but a credible mapping cannot be obtained when the dimension is reduced to two dimensions, for example, 11 points in the dimension reduction such as 10 dimensions are equidistant from each other, and a credible mapping result cannot be obtained in two dimensions (at most 3 points). As the dimensions increase, most of the data points are concentrated near the surface of the m-dimensional sphere, with a distribution of distances from the point xixi that is highly uneven. If this distance relationship is directly maintained to a low dimension, congestion problems arise. One direct consequence of the congestion problem is that clusters separated in the high dimensional space are not significantly (but can be divided into individual tiles) separated in the low dimension. Such as visualizing the results of the MNIST dataset with SNE. t-SNE finds structures within data based on probability distribution of random walks on a neighborhood graph, is mainly used for local structures of data, and tends to extract local clusters, which is very effective for visualizing high-dimensional data (such as MNIST datasets) containing multiple manifolds at the same time. Random adjacency embedding (SNE) begins by converting the high-dimensional euclidean distances between data points to conditional probabilities representing similarity. To measure the sum minimum of the conditional probability differences, SNE minimizes the KL distance using a gradient descent method. Although SNE provides a good visualization method, it is difficult to optimize and there is a "growing project" (congestion problem). While the cost function of the SNE focuses on the local structure of data in mapping, the function is very difficult to optimize, and the t-SNE adopts heavy tail distribution, so that the congestion problem and the SNE optimization problem can be reduced. The algorithmic calculation corresponds to conditional probabilities and attempts to minimize the sum of the probability differences of the higher and lower dimensions, which involves a large number of calculations and high demands on system resources. the complexity of t-SNE is quadratic in time and space with the number of data points. Based on the achieved accuracy, comparing t-SNE to PCA and other linear dimension reduction models, the results show that t-SNE can provide better results. This is because the algorithm defines soft boundaries between the local and global structure of the data. t-SNE is the most effective data dimension reduction and visualization method at present, but its disadvantages are also obvious, such as: large memory occupation and long operation time. Because the cost function is not convex, the result of executing the algorithm for multiple times is random, and the best result needs to be selected by multiple running.

Disclosure of Invention

Aiming at meeting the clustering requirements of mass high-dimensional AIS track data, the invention aims to provide a method for clustering mass high-dimensional AIS track data, which has high accuracy and high running speed and aims at overcoming the defects of the prior art.

The above object of the present invention can be achieved by the following technical solutions, and a method for clustering massive high-dimensional AIS track data is characterized by comprising the following steps:

1) Preprocessing AIS track data: extracting ship track data, taking track points with the same MMSI number as a track, dividing the track points into a plurality of tracks according to course information, deleting abnormal points which belong to the track and deviate from all track points, calculating the number of track points which need to be inserted after the abnormal points are deleted, and performing linear interpolation filling and data completion on track point vacancies which can occur after the abnormal points are deleted and missing values existing in original AIS data; normalizing the AIS data after interpolation completion, and mapping each attribute component in the track points to the range of 0-1;

2) Pre-training the self-encoder network: pre-training a self-encoder network consisting of an encoder and a decoder, inputting the preprocessed AIS track data into the self-encoder network for cyclic iteration, completing the process of 'input-dimension reduction-feature-dimension lifting-reconstruction' of the self-encoder network after multiple times of cyclic iteration, successfully initializing partial network parameters of the encoder in the self-encoder, and outputting a dimension-reduced track feature embedding point Z _i ；

3) Initializing a clustering center: based on k-means algorithm of Euclidean distance, clustering low-dimensional track characteristic space set extracted from encoder part in encoder to obtain initialized clustering point mu of clustering center _j ；

4) Constructing a deep clustering network: adding a pre-trained encoder into a clustering layer to construct a deep clustering network, converting Euclidean distance between data and clustering points of a clustering center into conditional probability to represent the probability of data point distribution to the clustering center based on t-SEN thought in machine learning, and calculating a track characteristic embedding point Z _i Assigned to the initialization spotlight mu _j The soft distribution probability and the initial target distribution of the clustering, meanwhile, the KL divergence is taken as a loss function of a deep clustering network, the auxiliary target distribution and the clustering target distribution used for measuring the samples are constructed, a gradient descent algorithm is adopted,respectively solving the embedding points Z of the loss function L relative to each track characteristic _i And cluster center μ _j And (3) the gradient of the clustering process is reduced, the distance between the two target distributions is shortened, a probability distribution column is formed, and when the cluster distribution change between 2 continuous iterations is smaller than a set value, the clustering process is stopped, and a final clustering result is obtained.

Compared with the prior art, the invention has the following beneficial effects,

the method takes trace points with the same MMSI number in AIS data as a trace, divides the trace points into a plurality of traces according to course information, deletes abnormal points which belong to the trace and deviate from all trace points, calculates the number of the trace points which need to be inserted after deleting the abnormal points, and performs linear interpolation filling and data completion on the trace point vacancy which can appear after deleting the abnormal points and the missing value of the original AIS data; performing normalization processing on the AIS data after interpolation completion, and mapping each attribute component in the track points to a range of 0-1; the problems that the track similarity measurement and the feature extraction are difficult, the clustering precision and the computing efficiency are low and the like in the conventional AIS track clustering method are solved. Thereby improving the final clustering effect.

The invention uses a self-encoder network consisting of an encoder and a decoder to extract AIS track characteristics and reduce dimensions, inputs the preprocessed AIS track into the network, and after multiple cycle iterations, the network completes the process of 'input-dimension reduction-characteristics-dimension lifting-reconstruction', the initialization of partial network parameters of the encoder in the self-encoder is successful, and the track characteristics after dimension reduction are output. An encoder part in the trained self-encoder can map original massive high-dimensional AIS track data to a 10-dimensional feature space and express the data; the self-encoder is adopted to carry out dimension reduction and feature extraction on the track data, and compared with the traditional PCA dimension reduction or artificial feature engineering, the self-encoder can automatically learn a group of good feature representations.

The method comprises the steps of extracting a coder part in a trained self-coder, adding a deep embedding clustering layer, clustering the reduced-dimension track feature embedded points by using a k-means algorithm based on Euclidean distance to obtain an initialized clustering center, and calculating the soft distribution probability of each feature embedded point distributed to the initialized clustering points to serve as original target distribution. Then, auxiliary target distribution (target distribution) is constructed, the distance between the original target distribution and the target distribution is calculated by using KL divergence, the network is trained in a circulating iteration mode, and meanwhile, the optimized network parameters and the clustering parameters are updated; based on Deep Embedded Clustering (DEC), the original data space can be mapped to a low-dimensional feature space by means of the feature extraction capability of a deep neural network, the feature representation of a track is automatically learned in the feature space, KL divergence is used as a clustering distribution loss function, a clustering target is iteratively optimized, the data feature representation and the clustering distribution are realized at the same time, the clustering precision is ensured, and meanwhile, the calculation efficiency can be improved. It also has the advantage of reducing the complexity of O (nk), where k is the number of cluster centers.

Aiming at the problem that the traditional clustering algorithm cannot well perform clustering on high-dimensional big data, after the training of a self-encoder is completed, an encoder is taken out, characteristic-based track clustering is performed, and a soft distribution clustering layer is initialized; measuring the similarity between the embedded point and the clustering center, calculating the soft distribution between the embedded point and the clustering center, using the characteristic representation and the clustering distribution of the deep neural network, standardizing the soft distribution, and constructing auxiliary target distribution and loss function training clustering: respectively solving the loss function L relative to each characteristic embedding point Z by adopting a gradient descent algorithm _i And cluster center μ _j And (3) learning and mapping the gradient to a low-dimensional feature space from the data space, iteratively optimizing a clustering target in the feature space, and stopping a clustering process when cluster distribution change among 2 continuous iterations is smaller than a set value to obtain a final clustering result. The method is simple to implement, has obvious effect, can be applied to different track clustering occasions, and provides a new solution for the massive high-dimensional track big data clustering.

According to the ship AIS track data clustering method based on deep embedded clustering, similarity measurement does not need to be set according to experience, the similarity measurement and the clustering distribution task can be carried out simultaneously, and the characteristic representation and the clustering distribution of the track data can be guaranteed to achieve good effects. Compared with the prior art, the method has the following beneficial effects:

the method can meet the clustering requirements of mass and high-dimensional AIS track big data; the track characteristics are extracted through a self-encoder in the DEC, the realization is simple, the implementation complexity is low, and the extracted track characteristics can express most of information in the original AIS track. Therefore, the track characteristics can be applied to different algorithms, and the algorithm efficiency can be improved on the premise of ensuring the accuracy of the algorithms; the initial clustering point part of the cluster is obtained, and any common clustering algorithm can be used, such as various classic clustering algorithms like K-means/DBSCAN/STING and the like. In practical application, the K-means algorithm is simple and efficient, so that the K-means algorithm is adopted to solve the initial point of convergence, and efficient implementation is facilitated.

Drawings

FIG. 1 is a flow chart of the present invention for implementing mass high-dimensional AIS trajectory data clustering;

FIG. 2 is a schematic diagram of AIS trajectory clustering based on DEC;

FIG. 3 is a diagram of a self-encoder network architecture;

FIG. 4 is a diagram of a deep clustering network architecture;

FIG. 5 is an AIS trace extraction graph of the present invention;

FIG. 6 is an AIS data exception point deletion map of the present invention;

FIG. 7 is an AIS data interpolation graph of the present invention;

FIG. 8 is a visualization of the AIS data after preprocessing of the present invention;

FIG. 9 is a diagram of AIS data deep embedding clustering effect of the present invention;

FIG. 10 is an exploded view 1 of AIS data deep embedding clustering of the present invention;

FIG. 11 is an exploded view 2 of the AIS data deep embedding cluster of the present invention;

FIG. 12 is an exploded view 3 of the AIS data deep embedding cluster of the present invention;

the conception, specific structure and technical effects of the present invention will be further described in conjunction with the accompanying drawings and embodiments, so that the objects, features and effects of the present invention can be fully understood.

Detailed Description

See fig. 1-5. According to the invention, the following steps are adopted:

1) Preprocessing AIS track data: extracting ship track data, taking track points with the same MMSI number as a track, dividing the track points into a plurality of tracks according to course information, deleting abnormal points which belong to the track and deviate from all track points, calculating the number of track points which need to be inserted after the abnormal points are deleted, and performing linear interpolation filling and data completion on track point vacancies which can occur after the abnormal points are deleted and missing values existing in original AIS data; performing normalization processing on the AIS data after interpolation completion, and mapping each attribute component in the track points to a range of 0-1;

4) Constructing a deep clustering network: adding a pre-trained encoder into a clustering layer to construct a deep clustering network, converting Euclidean distance between data and clustering center points into conditional probability to represent probability of data points distributed to clustering centers based on t-SEN thought in machine learning, and calculating a track characteristic embedding point Z _i Assigned to the initialization focus μ _j The method comprises the steps of (1) soft distribution probability (soft distribution probability, namely initial target distribution of clustering), and meanwhile, constructing auxiliary target distribution and distribution (target distribution) which is used for measuring samples and belongs to a certain clustering target by taking KL divergence as a loss function of a deep clustering network. To approximate the distance between two distributions, KL divergence is used as the deep clustering network lossA function. Respectively solving the embedding points Z of the loss function L relative to each track characteristic by adopting a gradient descent algorithm _i And cluster center μ _j And (3) the gradient of the target distribution is reduced to form a probability distribution list, and when the cluster distribution change between 2 continuous iterations is smaller than a set value, the clustering process is stopped to obtain a final clustering result.

The specific implementation steps can be divided into four parts: 1) Preprocessing AIS track data; 2) Pre-training a self-encoder network, and extracting track characteristics; 3) Initializing a clustering center; 4) Constructing a deep clustering network for clustering;

in AIS data preprocessing, for the condition that the same ship area repeatedly reciprocates, dividing the ship area into a plurality of tracks according to course information, wherein a track point p _i = (t, lon, lat, sog, head), i track is expressed

T _i ＝(p _i1 ,p _i2 ,p _i3 ,…,p _in ) (1)

In the formula, i =1,2, \8230n, n represents the number of track points contained in the track, t represents the time of track point acquisition, lon represents longitude, lat represents latitude, sog represents ground heading, and head represents the bow direction.

And deleting the abnormal points. And deleting abnormal points belonging to the track, such as abnormal points with negative values of speed and deviating from all track points, and in addition, deleting the whole track containing track points less than half of the average points of all tracks without participating in later track clustering.

And (6) data interpolation. For trace point vacancy which can occur after the abnormal point is deleted and the original AIS data has medium value deficiency, linear interpolation is needed to be carried out to fill the deficiency value; the method comprises the steps of calculating the number of track points needing to be inserted when the time interval between two adjacent track points is larger than a given threshold value, then carrying out interpolation processing, firstly calculating the time interval between the two track points needing to be interpolated, and obtaining the number of the track points needing to be inserted:

after the number N of the inserted track points is obtained, interpolation processing needs to be carried out on the longitude and latitude, the ground speed and the bow direction of the track, and missing ship track data p in a time period is calculated _i ：

Wherein, t (p) _b -p _a ) Representing points of track P _b ,P _a Time interval between, t _threshold Is a predefined time threshold.

And (5) completing the data. Since the sampling rate of the AIS data varies with the ship speed, the track lengths are not exactly the same. In order to meet the input requirements of a neural network in the subsequent DEC, ship tracks with different lengths need to be converted into fixed lengths, and the longest track in a track data set is taken as a standard length. Considering that the starting position and the end position of a ship running on the same route are the same, a mode of two-end completion is adopted, namely, only the time attribute of the filled track point is changed, and other attributes are not changed. After track completion, all tracks have the same standard length.

And (6) normalizing the data. In order to accelerate the training speed of the network and improve the calculation efficiency, N high-dimensional data x are given ₁ 、x ₂ 、…x _N (Note that N is the number of data samples, not the dimension), t-SNE first computes the probability p _ji Is proportional to the data point x _i And x _j Mapping each attribute component in the track points to a range of 0-1, complementing AIS data on the interpolation, then carrying out normalization processing to obtain longitude lon and latitude lat after normalization, and obtaining attribute values after normalization to the ground heading sog and the bow-direction head: and (6) normalizing the data. In order to accelerate the training speed of the network and improve the calculation efficiency, N high-dimensional data x are given ₁ 、x ₂ 、…x _N (Note that N is the number of data samples, not the dimension), t-SNE firstCalculating the probability p _ji Is proportional to the data point x _i And x _j The similarity between the points is that each attribute component in the track points is mapped into a range of 0-1, AIS data is supplemented to the interpolation, then normalization processing is carried out, and longitude lon and latitude lat after normalization, and an attribute value x' after normalization to the ground heading sog and the bow heading head are obtained:

at this time, the attribute values of all track points are mapped into the range of 0-1,

in the formula, the attribute values before normalization, including longitude lon, latitude lat, local heading sog and bow head, are the maximum attribute values, the minimum attribute values and the normalized attribute values. At this time, the attribute values of all the track points are mapped into the range of 0-1.

Extracting track characteristics from a pre-training self-encoder network, inputting the pre-processed track into an automatic encoder network for training, carrying out the pre-processing on an original AIS track, training track points through a plurality of times of loop iteration network, completing the process of 'inputting-dimension reduction-characteristics-dimension lifting-reconstruction', and forming track characteristic data as follows,

Trj _i ＝(p _i1 ,p _i2 ,…,p _im ) (5)

after the training of the multiple-cycle iterative network is completed, namely the input and the output are infinitely close, the self-encoder network completes the process of 'input-dimension reduction-feature-dimension lifting-reconstruction', and the initialization of partial network parameters of an encoder in the self-encoder is successful. At this time, the output of the auto-encoder is the trajectory Trj _i And (5) reducing the dimension. The encoder in this case can be regarded as a neural network that maps a high-dimensional data space to a low-dimensional data space, and can be represented by the following equation:

f(Trj _i ,θ)＝z _i (6)

wherein p is _i Represents the ith trace point, i =1,2, \8230M, m represents the number of track points contained in the track, f is a nonlinear mapping function, t represents the acquisition time of the track points, theta is a nonlinear mapping parameter which can be learned in a neural network, and z _i Is the track Trj _i And (3) embedding points of the features in the low-dimensional feature space after the features are mapped by the encoder network, namely the trace features to be clustered subsequently.

See fig. 3. The self-encoder network structure is shown in fig. 3. Pre-training a self-encoder network, initializing network parameters and extracting track characteristics. Firstly, an auto-encoder needs to be trained, and the auto-encoder comprises: the encoder completes the task of encoding input track data and mapping high-dimensional track data characteristics to low-dimensional track data characteristics, and the decoder recovers original input data from the low-dimensional track data characteristics of the self-encoder network, contrary to the encoder; in this embodiment, the self-encoder network is a network with 9 layers, the layer 1 is an input feature dimension of a ship track, the input feature dimension is 682 dimensions, the

layers

2 and 3 are 500 dimensions, the layers 3 to 4 are 200 dimensions, the layer 5 is 10 dimensions, the layer 6 is 200 dimensions, the

layers

7 and 8 are 500 dimensions, the layer 9 is a data feature dimension 682 dimensions, the middle layers use a ReLU function as an activation function, and the self-encoder network outputs a feature with 10 dimensions. To measure the difference between the input vector and the output vector, the neural network was trained using Mean Square Error (MSE) as a loss function, and all neural networks used in the experiment were fully connected.

Initializing a clustering center: in order to obtain the initialized clustering center, an initial clustering point can be obtained by using a k-means algorithm based on Euclidean distance. Clustering a track feature set Z by a K-means algorithm based on Euclidean distance, wherein the number of clustered clusters is K, and the center of each cluster is mu _j J is more than or equal to 1 and less than or equal to K, and a track data set (Trj) ₁ ,Trj ₂ 8230the method obtains a low-dimensional feature space set of Z = (Z) through self-encoder feature extraction ₁ ,z ₂ \8230), the initial set of foci is μ = (μ =) ₁ ,μ ₂ ,…,μ _K )。

See fig. 4. Deep clustering networkThe junction is shown in figure 4. And taking out the encoder part in the pre-trained self-encoder network, and adding the encoder part into the clustering layer to form a deep clustering network. In order to be able to measure the similarity between the trajectories. Based on the t-SNE idea in machine learning, two distributions need to be constructed, and iterative clustering is realized by shortening the distance between the two distributions. Firstly, converting Euclidean distance between data and cluster points of a cluster center into conditional probability to represent probability of data points distributed to the cluster center, and calculating a track characteristic embedding point Z _i Assigned to the initialization focus μ _j Probability of (q) _ij Also called soft distribution probability, as the initial target distribution of the cluster,

wherein z is _i Is the track Trj _i Feature insertion point, mu, in a low dimensional feature space after mapping by an encoder network _j Is the center of the jth cluster and α is the degree of freedom of the t-SNE distribution, typically set to 1.

Then, an assist target profile p constructed by the following expression is established _ij : stochastic adjacency embedding (SNE), converting the distance between data points in high-dimensional Oldham to a conditional probability q representing similarity _j/i Obtaining the distribution of the track i to the class cluster center mu _j Probability value q of _ij ：

Wherein the soft clustering probability

And constructing a loss function training cluster. For soft allocation of probability values q to the cluster layer _ij And auxiliary target distribution p _ij Close to unity, the difference between the two distributions is measured using the relative entropy KL divergence, which as a loss function for a deep clustering network can be expressed as:

using a gradient descent algorithm, the loss function L is determined separately for each feature insertion point z _i And cluster center μ _j As shown in the following formula:

in the training process, in order to learn the nonlinear mapping parameter theta (step S2 is only pre-training) and the clustering center mu at the same time _j (K-mean yields only the initial cluster center).

See fig. 6. In the following, the real ship AIS data of a certain sea area is taken as an example. The whole experiment data source comprises AIS data, the original data comprises attributes such as a marine mobile communication service identification code MMSI (mobile multimedia interface), data receiving time BaseDataTime, a position dimension LAT, a position precision LON, a ground speed SOG, a ground Heading COG, a bow Heading, a navigation state Status, a ship type vesselType and the like, and the original AIS data, csv files are stored in a database, and the navigation route of the ship is in a divergent state after the ship is out of the sea and away from a port. Therefore, the AIS trajectory data space span of the extracted harbor in the embodiment is (minimum precision: -123.93299, maximum precision: -112.64193; minimum dimension 48.10732, maximum dimension 48.50108), 104930 track points in total, and the visualization is as shown in FIG. 6.

And deleting the abnormal points. In the embodiment, the ship track deletion with the AIS track data point number less than 100 is removed, because no obvious route is formed, simultaneously, according to the distance between the maximum two points in a certain ship track, the massive track is deleted, the value of track point jump obviously appears is deleted, as shown in the last track point in fig. 8, the time attribute suddenly jumps to the middle of the track, and obvious errors need to be deleted.

See fig. 9. In this embodiment, the number of pre-training times is set to 100, the data processing batch size is 8, and the iteration stop condition 2 × 10 ^-3 Maximum number of iterations 2 x 10 ⁴ The number of clusters 18 is initialized, and the degree of freedom α =1 of the t-SNE distribution is set. Carrying out deep clustering on the preprocessed AIS track dataThe clustering results are shown in FIG. 9.

The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A massive high-dimensional AIS track data clustering method is characterized by comprising the following steps:

1) Preprocessing AIS track data: extracting ship track data, taking track points with the same MMSI number as a track, dividing the track into a plurality of tracks according to course information, deleting abnormal points which belong to the track and deviate from all track points, calculating the number of track points which need to be inserted after the abnormal points are deleted, and performing linear interpolation filling and data completion on track point vacancies which can occur after the abnormal points are deleted and missing values existing in original AIS data; normalizing the AIS data of the automatic ship identification system after interpolation completion, and mapping each attribute component in the track points to a range of 0-1;

2) Pre-training the self-encoder network: pre-training a self-encoder network consisting of an encoder and a decoder, inputting the preprocessed AIS track data into the self-encoder network for cyclic iteration, completing the process of 'input-dimension reduction-feature-dimension increase-reconstruction' of the self-encoder network after multiple cyclic iterations, outputting the dimension-reduced track feature embedding point Z after the initialization of partial parameters of the self-encoder network encoder is successful _i ；

4) Constructing a deep clustering network: adding a pre-trained encoder into a clustering layer to construct a deep clustering network, constructing two distributions based on a t-SNE thought in machine learning, and realizing iterative clustering by shortening the distance between the two distributions;

firstly, the Euclidean distance between data and a cluster point of a cluster center is converted into a conditional probability to represent the probability of the data point distributed to the cluster center, and a track characteristic embedding point Z is calculated _i Assigned to the initialization focus μ _j According to the initial clustering target distribution, constructing auxiliary target distribution for measuring the clustering target distribution of the samples, taking KL divergence as a deep clustering network loss function, and respectively solving the loss function L relative to each track characteristic embedding point Z by adopting a gradient descent algorithm _i And cluster center μ _j And (3) the gradient of the target distribution is reduced to form a probability distribution list, and when the cluster distribution change between 2 continuous iterations is smaller than a set value, the clustering process is stopped to obtain a final clustering result.

2. The massive high-dimensional AIS trajectory data clustering method according to claim 1, characterized in that: in AIS data preprocessing, for the condition that the ship area repeatedly reciprocates, dividing the ship area into a plurality of tracks according to course information, wherein a track point p _i = (t, lon, lat, sog, head), jth trace is shown

3. The massive high-dimensional AIS trajectory data clustering method according to claim 2, characterized in that: performing linear interpolation on trace point vacancies appearing after the abnormal points are deleted and the original AIS data has medium value deficiency, and filling the deficiency values; when the time interval between two adjacent track points is greater than a given threshold value, calculating the number of the track points needing to be inserted, then performing interpolation processing, firstly calculating the time interval between two track points b and a needing to be interpolated, and obtaining the number of the track points needing to be inserted:

after the number N of the inserted track points is obtained, interpolation processing is carried out on the longitude and latitude, the ground speed and the bow direction of the track, and missing ship track data p in a time period is calculated _i ：

4. The massive high-dimensional AIS trajectory data clustering method according to claim 1, characterized in that: in order to accelerate the training speed of the network and improve the calculation efficiency, N high-dimensional data x are given ₁ 、x ₂ 、…x _N First, the probability p is calculated _ji Is proportional to the data point x _i And x _j The similarity between the longitude and latitude lines is that each attribute component in the track points is mapped into a range of 0-1, AIS data is supplemented to interpolation values, then normalization processing is carried out, and a normalized attribute value x' of longitude lon and latitude lat after normalization, the longitude and latitude log after normalization and the heading head direction head are obtained:

at the moment, the attribute values of all track points are mapped into the range of 0-1;

wherein N is the number of data samples, x is the attribute value before normalization, x _max Is the maximum attribute value, x _min Is the minimum attribute value.

5. The massive high-dimensional AIS trajectory data clustering method of claim 1, characterized in that: extracting track characteristics in a pre-training self-encoder network, and carrying out pre-processing on the track Trj _i Inputting into the automatic encoder network for training, and obtaining the original AIS track T _i After preprocessing, training the locus point P by a multiple-cycle iterative network _i = (t, lon, lat, sog, head), complete the "input-dimension reduction-feature-dimension lifting-reconstruction" process, form trajectory feature data as,

Trj _i ＝(p _i1 ,p _i2 ,…,p _im ) (5)

automatic encoder output track characteristic data Trj _i For the reduced-dimension features, the encoder maps the high-dimension data space to a neural network of the low-dimension data space:

f(Trj _i ,θ)＝z _i (6)

wherein p is _i Representing the ith track point, i =1,2, \8230;, m, m represents the number of track points contained in the track, f is a nonlinear mapping function, t represents the acquisition time of the track point, theta is a learnable nonlinear mapping parameter in a neural network, and z is _i Is the track Trj _i And embedding the characteristic into a point in a low-dimensional characteristic space after the characteristic is mapped by the encoder network.

6. The massive high-dimensional AIS trajectory data clustering method of claim 1, characterized in that: the self-encoder includes: the self-encoder completes the task of encoding input track data and mapping high-dimensional track data characteristics to low-dimensional track data characteristics, and the decoder recovers original input data from the low-dimensional track data characteristics of the self-encoder network, contrary to the self-encoder; the self-encoder network is provided with 9 layers of networks, the layer 1 is an input characteristic dimension of a ship track, the input characteristic dimension is 682-dimensional, the layers 2 and 3 are 500-dimensional, the layers 3-4 are 200-dimensional, the layer 5 is 10-dimensional, the layer 6 is 200-dimensional, the layers 7 and 8 are 500-dimensional, the layer 9 is a data characteristic dimension 682-dimensional, the middle layers use a ReLU function as an activation function, and the self-encoder network outputs a 10-dimensional characteristic.

7. The massive high-dimensional AIS trajectory data clustering method of claim 1, characterized in that: in order to obtain an initialized clustering center, K-means clustering based on Euclidean distance can be performed on the track feature set Z by using a K-means algorithm based on Euclidean distance to obtain an initial test clustering point, the number of clustered clusters is K, and the center of each cluster is mu _j J is more than or equal to 1 and less than or equal to K, and a track data set (Trj) ₁ ,Trj ₂ 8230and the low-dimensional feature space set obtained by the feature extraction of the self-encoder is Z = (Z) ₁ ,z ₂ \8230), the initial set of foci is μ = (μ =) ₁ ,μ ₂ ,…,μ _K )。

8. The massive high-dimensional AIS trajectory data clustering method of claim 1, characterized in that: and performing dimensionality reduction and feature extraction on original high-dimensional data by using a dimensionality reduction reconstruction function of the self-encoder network, then taking out an encoder part, and adding a clustering layer to construct a deep-embedded clustering network. Based on the t-SEN thought in machine learning, the Euclidean distance between data and cluster points of a cluster center is converted into conditional probability to represent the probability of the data points distributed to the cluster center, and a track characteristic embedding point Z is calculated _i Assigned to the initialization focus μ _j The soft allocation probability (initial clustering target distribution),

9. The massive high-dimensional AIS trajectory data clustering method according to claim 8, characterized in that: establishing a probability value p for the distribution of an auxiliary object which can be constructed as represented by _ij For measuring the distribution of samples belonging to a cluster, converting the high-dimensional Euclidean distance between data points into a conditional probability q representing similarity _j/i Obtaining the distribution of the track i to the clustering center mu _j Soft allocation probability value p _ij ，

Wherein the soft clustering probability

10. The massive high-dimensional AIS trajectory data clustering method of claim 1, characterized in that: and constructing a loss function training cluster. For soft assignment of probability values q to cluster layers _ij And auxiliary target distribution p _ij Close to unity, the difference between the two distributions is measured using the relative entropy KL divergence, as a loss function for the deep clustering network,

in the training process, a gradient descent algorithm is used for respectively solving the loss function L relative to each characteristic embedding point z _i And cluster center μ _j As shown in the following formula:

simultaneously learning nonlinear mapping parameters theta and clustering centers mu _j 。