CN106452452A

CN106452452A - Full-pulse data lossless compression method based on K-means clustering

Info

Publication number: CN106452452A
Application number: CN201610809393.XA
Authority: CN
Inventors: 王宏; 巫忠书; 钟洪声; 唐广; 李廷军
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-09-08
Filing date: 2016-09-08
Publication date: 2017-02-22

Abstract

The invention discloses a full-pulse data lossless compression method based on K-means clustering, and belongs to the field of data compression. The technical scheme adopted in the invention is that the data are subjected to K-means clustering processing at first, points with higher data similarity form the same cluster, values of center points in each cluster are kept, original data are replaced with differences between the data points and the center points, and the differences are much smaller than the original data after processing; and then, the differences are subjected to run-length coding at first, and then subjected to range coding. The compression method is good in compression effect and high in reliability, and can better compress the full-pulse data without loss.

Description

A kind of overall pulse data lossless compression method based on K-means cluster

Technical field

The invention belongs to field of data compression is and in particular to arrive a kind of data compression method based on K-means cluster, real The now lossless compress to electronic countermeasure field overall pulse data.

Background technology

In modern military, electronic countermeasure plays vital effect in strategic attacking and defending.Electronic countermeasure is that enemy and we are double Side takes various electronics measures and action, in order to weaken or to destroy other side's electronic equipment effective utilization, to ensure one's own side's electronic equipment Play a kind of mode of operation of efficiency.Countermeasure search is the important component part of electronic countermeasure, and excellent electronic countermeasure is detectd The technology of examining is the key point grasped the opportunity in advance in electronic warfare.

Overall pulse data is that the key characterization parameter being comprised reconnaissance plane medium-frequency pulse is stored with binary system Plant data type.The feature of overall pulse data is：(1) each data point comprises five parameters, is pulse arrival time respectively (TOA), pulse width (PW), pulse amplitude (PA), pulse carrier frequency (CF) and pulse angle of arrival (DOA)；(2) data is with binary system Form represents, each parameter takies 4 bytes；(3) overall pulse data volume is very big, is not easy to store and transmits；(4) overall pulse number According to the characteristic ginseng value due to comprising a large amount of repetition pulses from same radar emission source, these data have stronger correlation Property, therefore there is substantial amounts of redundancy.Because overall pulse data volume is greatly it is necessary to be compressed to it, to reduce the size of data, It is easy to store and transmit；And because there is redundancy, it is possible to being compressed in data.Overall pulse packet intermediate frequency containing reconnaissance plane arteries and veins The key feature information of punching, compression and decompression procedure can not lose any information it is therefore desirable to adopt lossless compress.

Cluster analyses, as a statistical branch, are mainly used in Data Mining.Clustering algorithm includes K-means Cluster, FCM cluster, Canopy cluster etc..K-means clustering algorithm is easy to describe, and has speed soon and is applied to process greatly The advantages of scale data.Present invention firstly provides K-means clustering algorithm is used for the lossless compress of overall pulse data.Overall pulse Data record is from the not pulse data parameter in the same time of multiple emission sources.From multiple pulses of same emission source, its Characteristic ginseng value difference is less, has very strong dependency, is attributed in same class cluster by K-means cluster.To each Data in class cluster, substitutes former data value with data point relative to the difference of central point, and numerical value is relatively compared with initial value for the difference obtaining Little.After differential coding, the information bit that output code flow takies is less, thus reaching the purpose of data compression.

Content of the invention

At present in field of data compression, general lossless compression algorithm (such as LZ series coding) is generally directed to text data It is compressed, and the data source compression effectiveness to binary format bad.The present invention propose a kind of based on K-means cluster Overall pulse data lossless compression method, the method compression effectiveness is good, and reliability is high, preferably can carry out overall pulse data lossless Compression.

The technical solution used in the present invention is first data to be done K-means clustering processing, the larger point shape of data similarity Become same class cluster, each class cluster retains the numerical value of central point, and replaces former data with the difference of data point and central point, place After reason, difference can be more much smaller than former data value.Then difference is first done run-length encoding, then do Interval Coding.Due to the code after coding The information bit that stream takies is less, can obtain preferable compression effectiveness.

For achieving the above object, the invention mainly includes steps：

Step one：K-means cluster is carried out to the overall pulse data comprising reconnaissance plane intermediate frequency data critical parameter information, obtains Central point to multiple class clusters and each class cluster.

K-means cluster needs to specify in advance cluster numbers K.Generally, cluster numbers K value existsBetween, its Middle n is the number of samples of data set.In practical application, real cluster numbers are unknown.Experience have shown that, cluster numbers K are more than true In the case of value, compression effectiveness change is little, and when cluster numbers K are less than actual value, compression effectiveness is poor.In general electronic warfare, Target number (i.e. cluster numbers) is within 20, so the present invention selects cluster numbersThis kind of K value system of selection when Between have good behaviour on complexity and compression effectiveness.

Step 2：Internal for each class cluster all data points are made the difference with such cluster central point, obtains difference data.

Step 3：Difference is done run-length encoding.

Step 4：Data after run-length encoding is done Interval Coding.

Step 5：Code stream after Interval Coding is exported together with central value and obtains compression result.

Following beneficial effect is obtained in that using technical scheme proposed by the present invention：The present invention can enter to overall pulse data Row lossless compress, and 2 times about of compression ratio is obtained with less time overhead.Data is carried out by the present invention with respect to directly Coding and for not carrying out K-means cluster preprocessing, it is possible to obtain about 20% about compression multiple lifting.This is because this What invention proposed has following characteristics based on the lossless compression method of K-means cluster：1) K-means cluster has relatively to high dimensional data Good dependency, and calculating speed is very fast；2) using predetermined clusters number in clusteringProcessing method, with Less time overhead obtains preferable compression effectiveness；3) difference data is carried out encoding and directly phase is encoded to initial data Ratio is more beneficial for improving code efficiency.

Brief description

Fig. 1 is the overall pulse lossless date-compress flow chart based on K-means cluster.

Fig. 2 is K-means clustering algorithm flow chart.

Fig. 3 is K-means Clustering Effect schematic diagram.

Specific embodiment

For making the object, technical solutions and advantages of the present invention definitely, full arteries and veins based on K-means cluster is explained in detail below Rush the specific implementation step of lossless date-compress algorithm.

As shown in figure 1, being comprised the following steps based on the overall pulse data lossless compression method of K-means cluster：

Step one：K-means cluster conversion is carried out to the overall pulse data of input, data source X is reassembled into one is Set C={ the C of row class cluster₁,C₂... ..., C_K}.Wherein, C₁∪C₂∪…∪C_K=X,i≠j；I, j=1, 2,……,K.

The idiographic flow of K-means cluster is as shown in Figure 2：

1) input data source X comprises n data point { x₁, x₂..., x_n, each data point is to comprise p characteristic parameter

P dimension data.

2) randomly select K data point as the initial cluster center of each class cluster, calculate each data point to K respectively Poly-

The Euclidean distance at class center, if the distance meeting certain data point and a certain Lei Cu center be less than its with every other Cluster

The distance at center, then be divided into this data point in the class cluster representated by this cluster centre, obtains initial K and gathers

Class divides；

3) recalculate K new cluster centre：

Wherein, μ_iRepresent the central point of i-th class cluster, N_iRepresent the data point number in i-th class cluster, x_ijRepresent i-th

J-th data point in individual class cluster；

4) calculate the Euclidean distance at each class number of clusters strong point and its central point, obtain the total distance of all kinds of clusters and (namely partially Difference) J.

Specific formula for calculation is as follows：

5) continuous repeat step 3) and 4) calculating, convergence is judged:Cluster target be make all kinds of Cluster is total

Distance and, that is, deviation J reaches minimum.If the change of deviation J tried to achieve after iterating is less than a certain pre- If

Accuracy value ε, (assumes ε=10^-6), then show algorithmic statement, calculating terminates, otherwise return 2) recalculate.

K-means Clustering Effect schematic diagram is as shown in Figure 3.Generate 150 2-D datas at random, every 50 random data are One class, average is respectively [- 1, -1], [1,1], [1, -1], and variance is [1,1].After K-means clustering processing, data Accurately it is divided three classes, with its average closely, Clustering Effect is preferable for the central point of every class.

Step 2：The data processing through K-means clustering algorithm forms K class cluster, for each class cluster, seeks data Put the difference with central point.

Step 3：Each difference data put is regarded as byte stream, run-length encoding is carried out to it.

Step 4：Data after run-length encoding is carried out Interval Coding.

In order to obtain preferable compression effectiveness, present invention employs Interval Coding algorithm.Interval Coding is by inputted number According to being mapped in a certain integer range, final output one belongs to this interval integer as exports coding.Interval Coding can Realize the compression ratio more taller than one this compression upper limit of one symbol of huffman coding.

Interval Coding mainly includes the following steps that：

1) in units of byte, the data after run-length encoding is read out, each byte data regards a symbol as, system The species number N of meter symbol, as the initial value of the total frequency T of all symbols.

2) set an initial integer range [L, H], and initialize the Lower and upper bounds in interval：Upper bound H=0xf0000000, Lower bound L=0x00000000, then initially interval scope R=0xf0000000.In addition, setting interval normalized minimum zone R_min=0x00010000.

3) initial mapping calculating corresponding to distinct symbols is interval.According to the current frequency f of a certain symbol S_s, accumulated frequence F_sAnd the total frequency T of all symbols, calculate initial mapping interval [L', H'] of symbol S.

The accumulated frequence F of symbol S_sRefer to that value of symbol is less than other symbol (x of S<S frequency summation), available formula (3) calculated：

Lower and upper bounds H', L' of initial mapping interval [L', H'] and scope R' are specifically shown in formula (4), (5), (6).Wherein, Div represents divides exactly computing.

R'=RdivT × f_s(4)

L'=L+RdivT × F_s(5)

H'=L'+R'-1=L+RdivT × (F_s+f_s)-1 (6)

4), when the data containing multiple symbols being encoded, its map section is constantly updated according to the symbol of current input Between.Updating the calculating that principle is next incoming symbol mapping range is that mapping range based on last symbol is carried out.Concrete meter Calculate and still adopt formula (4) to (6), the parameter in formula is updated using adaptive method.I.e. after current input symbol S, Its frequency f_sPlus 1, accumulated frequence F_sAlso correspondingly carry out calculating updating with the total frequency T of all symbols.According to the frequency after updating Degree f_s, accumulated frequence F_sAnd the total frequency T of all symbols, the mapping range [L', H'] of the symbol S after being updated.

5) mapping range scope R' after updating<R_min(R_minRepresent smallest interval scope) or compared in units of byte Lower and upper bounds between new district, when the upper byte of bound is equal, removal identical upper byte is as output code flow and right Interval carries out normalization process.

6) according to above step, all input datas are encoded.During end-of-encode, all of in removal mapping range Position is as output code flow, and saves as binary file.

Step 5：The binary file being formed after Interval Coding is saved as output file together with central value.

Application example：

From one section of overall pulse data as sample, size is 3721KB, data point number n=10000, each data point Comprise 5 characteristic parameters.Data source is 6 classes, and cluster numbers K elect 50 as, and after K-means cluster compression, size is 1868KB, Compression ratio is 50.2%.The lossless compress effect of overall pulse data and specific data sample have substantial connection, to some full arteries and veins Rush data, compression ratio can reach 30% about.

Method proposed by the present invention is not limited to the example described in specific embodiment, and those skilled in the art are according to this Bright technical scheme draws other embodiments, as long as carrying out lossless pressure using K-means cluster conversion to overall pulse data The algorithm of contracting, including the device realizing corresponding function, similarly belongs to the innovation scope of the present invention, needs to be protected.

Claims

1. a kind of overall pulse data lossless compression method based on K-means cluster, comprises the following steps：

Step one：K-means cluster is carried out to the overall pulse data comprising reconnaissance plane intermediate frequency data critical parameter information, obtains many Individual class cluster and the central point of each class cluster；

Step 2：Internal for each class cluster all data points are made the difference with such cluster central point, obtains difference data；

Step 3：Difference is done run-length encoding；

Step 4：Data after run-length encoding is done Interval Coding；

2. a kind of overall pulse data lossless compression method based on K-means cluster as claimed in claim 1, its feature exists In：In step one, K-means cluster comprises the following steps：

1) input data source X comprises n data point { x₁, x₂..., x_n, each data point is the p dimension comprising p characteristic parameter According to；

2) randomly select K data point as the initial cluster center of each class cluster, K value existsBetween；Calculate respectively Each data point is to the Euclidean distance of K cluster centre, if the distance meeting certain data point with a certain Lei Cu center is less than it With the distance of every other cluster centre, then this data point is divided in the class cluster representated by this cluster centre, obtains just K clustering of beginning；

3) recalculate K new cluster centre：

μ_{i} = \frac{1}{N_{i}} Σ_{j = 1}^{N_{i}} x_{i j}, i = 1, 2, ..., K - - - (1)

Wherein, μ_iRepresent the central point of i-th class cluster, N_iRepresent the data point number in i-th class cluster, x_ijRepresent i-th class cluster In j-th data point；

4) calculate the Euclidean distance at each class number of clusters strong point and its central point, obtain the total distance of all kinds of clusters and namely deviation J, tool Body computing formula is as follows：

J = Σ_{i = 1}^{K} Σ_{j = 1}^{N_{i}} | | x_{i j} - μ_{i} | |^{2} - - - (2)

Constantly repeat step 3) and 4) calculating, convergence is judged:The target of cluster makes all kinds of clusters total Distance is with that is, deviation J reaches minimum；If the change of deviation J tried to achieve after iterating is less than default accuracy value ε, table Bright algorithmic statement, calculating terminates, and otherwise returns 2) recalculate.

3. a kind of overall pulse data lossless compression method based on K-means cluster as claimed in claim 2, its feature exists In：Cluster numbers