CN106452452A - Full-pulse data lossless compression method based on K-means clustering - Google Patents

Full-pulse data lossless compression method based on K-means clustering Download PDF

Info

Publication number
CN106452452A
CN106452452A CN201610809393.XA CN201610809393A CN106452452A CN 106452452 A CN106452452 A CN 106452452A CN 201610809393 A CN201610809393 A CN 201610809393A CN 106452452 A CN106452452 A CN 106452452A
Authority
CN
China
Prior art keywords
cluster
data
class
point
compression method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610809393.XA
Other languages
Chinese (zh)
Inventor
王宏
巫忠书
钟洪声
唐广
李廷军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201610809393.XA priority Critical patent/CN106452452A/en
Publication of CN106452452A publication Critical patent/CN106452452A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4006Conversion to or from arithmetic code
    • H03M7/4012Binary arithmetic codes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a full-pulse data lossless compression method based on K-means clustering, and belongs to the field of data compression. The technical scheme adopted in the invention is that the data are subjected to K-means clustering processing at first, points with higher data similarity form the same cluster, values of center points in each cluster are kept, original data are replaced with differences between the data points and the center points, and the differences are much smaller than the original data after processing; and then, the differences are subjected to run-length coding at first, and then subjected to range coding. The compression method is good in compression effect and high in reliability, and can better compress the full-pulse data without loss.

Description

A kind of overall pulse data lossless compression method based on K-means cluster
Technical field
The invention belongs to field of data compression is and in particular to arrive a kind of data compression method based on K-means cluster, real The now lossless compress to electronic countermeasure field overall pulse data.
Background technology
In modern military, electronic countermeasure plays vital effect in strategic attacking and defending.Electronic countermeasure is that enemy and we are double Side takes various electronics measures and action, in order to weaken or to destroy other side's electronic equipment effective utilization, to ensure one's own side's electronic equipment Play a kind of mode of operation of efficiency.Countermeasure search is the important component part of electronic countermeasure, and excellent electronic countermeasure is detectd The technology of examining is the key point grasped the opportunity in advance in electronic warfare.
Overall pulse data is that the key characterization parameter being comprised reconnaissance plane medium-frequency pulse is stored with binary system Plant data type.The feature of overall pulse data is:(1) each data point comprises five parameters, is pulse arrival time respectively (TOA), pulse width (PW), pulse amplitude (PA), pulse carrier frequency (CF) and pulse angle of arrival (DOA);(2) data is with binary system Form represents, each parameter takies 4 bytes;(3) overall pulse data volume is very big, is not easy to store and transmits;(4) overall pulse number According to the characteristic ginseng value due to comprising a large amount of repetition pulses from same radar emission source, these data have stronger correlation Property, therefore there is substantial amounts of redundancy.Because overall pulse data volume is greatly it is necessary to be compressed to it, to reduce the size of data, It is easy to store and transmit;And because there is redundancy, it is possible to being compressed in data.Overall pulse packet intermediate frequency containing reconnaissance plane arteries and veins The key feature information of punching, compression and decompression procedure can not lose any information it is therefore desirable to adopt lossless compress.
Cluster analyses, as a statistical branch, are mainly used in Data Mining.Clustering algorithm includes K-means Cluster, FCM cluster, Canopy cluster etc..K-means clustering algorithm is easy to describe, and has speed soon and is applied to process greatly The advantages of scale data.Present invention firstly provides K-means clustering algorithm is used for the lossless compress of overall pulse data.Overall pulse Data record is from the not pulse data parameter in the same time of multiple emission sources.From multiple pulses of same emission source, its Characteristic ginseng value difference is less, has very strong dependency, is attributed in same class cluster by K-means cluster.To each Data in class cluster, substitutes former data value with data point relative to the difference of central point, and numerical value is relatively compared with initial value for the difference obtaining Little.After differential coding, the information bit that output code flow takies is less, thus reaching the purpose of data compression.
Content of the invention
At present in field of data compression, general lossless compression algorithm (such as LZ series coding) is generally directed to text data It is compressed, and the data source compression effectiveness to binary format bad.The present invention propose a kind of based on K-means cluster Overall pulse data lossless compression method, the method compression effectiveness is good, and reliability is high, preferably can carry out overall pulse data lossless Compression.
The technical solution used in the present invention is first data to be done K-means clustering processing, the larger point shape of data similarity Become same class cluster, each class cluster retains the numerical value of central point, and replaces former data with the difference of data point and central point, place After reason, difference can be more much smaller than former data value.Then difference is first done run-length encoding, then do Interval Coding.Due to the code after coding The information bit that stream takies is less, can obtain preferable compression effectiveness.
For achieving the above object, the invention mainly includes steps:
Step one:K-means cluster is carried out to the overall pulse data comprising reconnaissance plane intermediate frequency data critical parameter information, obtains Central point to multiple class clusters and each class cluster.
K-means cluster needs to specify in advance cluster numbers K.Generally, cluster numbers K value existsBetween, its Middle n is the number of samples of data set.In practical application, real cluster numbers are unknown.Experience have shown that, cluster numbers K are more than true In the case of value, compression effectiveness change is little, and when cluster numbers K are less than actual value, compression effectiveness is poor.In general electronic warfare, Target number (i.e. cluster numbers) is within 20, so the present invention selects cluster numbersThis kind of K value system of selection when Between have good behaviour on complexity and compression effectiveness.
Step 2:Internal for each class cluster all data points are made the difference with such cluster central point, obtains difference data.
Step 3:Difference is done run-length encoding.
Step 4:Data after run-length encoding is done Interval Coding.
Step 5:Code stream after Interval Coding is exported together with central value and obtains compression result.
Following beneficial effect is obtained in that using technical scheme proposed by the present invention:The present invention can enter to overall pulse data Row lossless compress, and 2 times about of compression ratio is obtained with less time overhead.Data is carried out by the present invention with respect to directly Coding and for not carrying out K-means cluster preprocessing, it is possible to obtain about 20% about compression multiple lifting.This is because this What invention proposed has following characteristics based on the lossless compression method of K-means cluster:1) K-means cluster has relatively to high dimensional data Good dependency, and calculating speed is very fast;2) using predetermined clusters number in clusteringProcessing method, with Less time overhead obtains preferable compression effectiveness;3) difference data is carried out encoding and directly phase is encoded to initial data Ratio is more beneficial for improving code efficiency.
Brief description
Fig. 1 is the overall pulse lossless date-compress flow chart based on K-means cluster.
Fig. 2 is K-means clustering algorithm flow chart.
Fig. 3 is K-means Clustering Effect schematic diagram.
Specific embodiment
For making the object, technical solutions and advantages of the present invention definitely, full arteries and veins based on K-means cluster is explained in detail below Rush the specific implementation step of lossless date-compress algorithm.
As shown in figure 1, being comprised the following steps based on the overall pulse data lossless compression method of K-means cluster:
Step one:K-means cluster conversion is carried out to the overall pulse data of input, data source X is reassembled into one is Set C={ the C of row class cluster1,C2... ..., CK}.Wherein, C1∪C2∪…∪CK=X,i≠j;I, j=1, 2,……,K.
The idiographic flow of K-means cluster is as shown in Figure 2:
1) input data source X comprises n data point { x1, x2..., xn, each data point is to comprise p characteristic parameter
P dimension data.
2) randomly select K data point as the initial cluster center of each class cluster, calculate each data point to K respectively Poly-
The Euclidean distance at class center, if the distance meeting certain data point and a certain Lei Cu center be less than its with every other Cluster
The distance at center, then be divided into this data point in the class cluster representated by this cluster centre, obtains initial K and gathers
Class divides;
3) recalculate K new cluster centre:
Wherein, μiRepresent the central point of i-th class cluster, NiRepresent the data point number in i-th class cluster, xijRepresent i-th
J-th data point in individual class cluster;
4) calculate the Euclidean distance at each class number of clusters strong point and its central point, obtain the total distance of all kinds of clusters and (namely partially Difference) J.
Specific formula for calculation is as follows:
5) continuous repeat step 3) and 4) calculating, convergence is judged:Cluster target be make all kinds of Cluster is total
Distance and, that is, deviation J reaches minimum.If the change of deviation J tried to achieve after iterating is less than a certain pre- If
Accuracy value ε, (assumes ε=10-6), then show algorithmic statement, calculating terminates, otherwise return 2) recalculate.
K-means Clustering Effect schematic diagram is as shown in Figure 3.Generate 150 2-D datas at random, every 50 random data are One class, average is respectively [- 1, -1], [1,1], [1, -1], and variance is [1,1].After K-means clustering processing, data Accurately it is divided three classes, with its average closely, Clustering Effect is preferable for the central point of every class.
Step 2:The data processing through K-means clustering algorithm forms K class cluster, for each class cluster, seeks data Put the difference with central point.
Step 3:Each difference data put is regarded as byte stream, run-length encoding is carried out to it.
Step 4:Data after run-length encoding is carried out Interval Coding.
In order to obtain preferable compression effectiveness, present invention employs Interval Coding algorithm.Interval Coding is by inputted number According to being mapped in a certain integer range, final output one belongs to this interval integer as exports coding.Interval Coding can Realize the compression ratio more taller than one this compression upper limit of one symbol of huffman coding.
Interval Coding mainly includes the following steps that:
1) in units of byte, the data after run-length encoding is read out, each byte data regards a symbol as, system The species number N of meter symbol, as the initial value of the total frequency T of all symbols.
2) set an initial integer range [L, H], and initialize the Lower and upper bounds in interval:Upper bound H=0xf0000000, Lower bound L=0x00000000, then initially interval scope R=0xf0000000.In addition, setting interval normalized minimum zone Rmin=0x00010000.
3) initial mapping calculating corresponding to distinct symbols is interval.According to the current frequency f of a certain symbol Ss, accumulated frequence FsAnd the total frequency T of all symbols, calculate initial mapping interval [L', H'] of symbol S.
The accumulated frequence F of symbol SsRefer to that value of symbol is less than other symbol (x of S<S frequency summation), available formula (3) calculated:
Lower and upper bounds H', L' of initial mapping interval [L', H'] and scope R' are specifically shown in formula (4), (5), (6).Wherein, Div represents divides exactly computing.
R'=RdivT × fs(4)
L'=L+RdivT × Fs(5)
H'=L'+R'-1=L+RdivT × (Fs+fs)-1 (6)
4), when the data containing multiple symbols being encoded, its map section is constantly updated according to the symbol of current input Between.Updating the calculating that principle is next incoming symbol mapping range is that mapping range based on last symbol is carried out.Concrete meter Calculate and still adopt formula (4) to (6), the parameter in formula is updated using adaptive method.I.e. after current input symbol S, Its frequency fsPlus 1, accumulated frequence FsAlso correspondingly carry out calculating updating with the total frequency T of all symbols.According to the frequency after updating Degree fs, accumulated frequence FsAnd the total frequency T of all symbols, the mapping range [L', H'] of the symbol S after being updated.
5) mapping range scope R' after updating<Rmin(RminRepresent smallest interval scope) or compared in units of byte Lower and upper bounds between new district, when the upper byte of bound is equal, removal identical upper byte is as output code flow and right Interval carries out normalization process.
6) according to above step, all input datas are encoded.During end-of-encode, all of in removal mapping range Position is as output code flow, and saves as binary file.
Step 5:The binary file being formed after Interval Coding is saved as output file together with central value.
Application example:
From one section of overall pulse data as sample, size is 3721KB, data point number n=10000, each data point Comprise 5 characteristic parameters.Data source is 6 classes, and cluster numbers K elect 50 as, and after K-means cluster compression, size is 1868KB, Compression ratio is 50.2%.The lossless compress effect of overall pulse data and specific data sample have substantial connection, to some full arteries and veins Rush data, compression ratio can reach 30% about.
Method proposed by the present invention is not limited to the example described in specific embodiment, and those skilled in the art are according to this Bright technical scheme draws other embodiments, as long as carrying out lossless pressure using K-means cluster conversion to overall pulse data The algorithm of contracting, including the device realizing corresponding function, similarly belongs to the innovation scope of the present invention, needs to be protected.

Claims (3)

1. a kind of overall pulse data lossless compression method based on K-means cluster, comprises the following steps:
Step one:K-means cluster is carried out to the overall pulse data comprising reconnaissance plane intermediate frequency data critical parameter information, obtains many Individual class cluster and the central point of each class cluster;
Step 2:Internal for each class cluster all data points are made the difference with such cluster central point, obtains difference data;
Step 3:Difference is done run-length encoding;
Step 4:Data after run-length encoding is done Interval Coding;
Step 5:Code stream after Interval Coding is exported together with central value and obtains compression result.
2. a kind of overall pulse data lossless compression method based on K-means cluster as claimed in claim 1, its feature exists In:In step one, K-means cluster comprises the following steps:
1) input data source X comprises n data point { x1, x2..., xn, each data point is the p dimension comprising p characteristic parameter According to;
2) randomly select K data point as the initial cluster center of each class cluster, K value existsBetween;Calculate respectively Each data point is to the Euclidean distance of K cluster centre, if the distance meeting certain data point with a certain Lei Cu center is less than it With the distance of every other cluster centre, then this data point is divided in the class cluster representated by this cluster centre, obtains just K clustering of beginning;
3) recalculate K new cluster centre:
&mu; i = 1 N i &Sigma; j = 1 N i x i j , i = 1 , 2 , ... , K - - - ( 1 )
Wherein, μiRepresent the central point of i-th class cluster, NiRepresent the data point number in i-th class cluster, xijRepresent i-th class cluster In j-th data point;
4) calculate the Euclidean distance at each class number of clusters strong point and its central point, obtain the total distance of all kinds of clusters and namely deviation J, tool Body computing formula is as follows:
J = &Sigma; i = 1 K &Sigma; j = 1 N i | | x i j - &mu; i | | 2 - - - ( 2 )
Constantly repeat step 3) and 4) calculating, convergence is judged:The target of cluster makes all kinds of clusters total Distance is with that is, deviation J reaches minimum;If the change of deviation J tried to achieve after iterating is less than default accuracy value ε, table Bright algorithmic statement, calculating terminates, and otherwise returns 2) recalculate.
3. a kind of overall pulse data lossless compression method based on K-means cluster as claimed in claim 2, its feature exists In:Cluster numbers
CN201610809393.XA 2016-09-08 2016-09-08 Full-pulse data lossless compression method based on K-means clustering Pending CN106452452A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610809393.XA CN106452452A (en) 2016-09-08 2016-09-08 Full-pulse data lossless compression method based on K-means clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610809393.XA CN106452452A (en) 2016-09-08 2016-09-08 Full-pulse data lossless compression method based on K-means clustering

Publications (1)

Publication Number Publication Date
CN106452452A true CN106452452A (en) 2017-02-22

Family

ID=58165400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610809393.XA Pending CN106452452A (en) 2016-09-08 2016-09-08 Full-pulse data lossless compression method based on K-means clustering

Country Status (1)

Country Link
CN (1) CN106452452A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062376A (en) * 2017-12-12 2018-05-22 清华大学 A kind of Time Series Compression storage method and system based on similar operating condition
CN109799483A (en) * 2019-01-25 2019-05-24 中国人民解放军空军研究院战略预警研究所 A kind of data processing method and device
CN109816029A (en) * 2019-01-30 2019-05-28 重庆邮电大学 High-order clustering algorithm based on military operations chain
CN111914923A (en) * 2020-07-28 2020-11-10 同济大学 Target distributed identification method based on clustering feature extraction
CN115622571A (en) * 2022-12-16 2023-01-17 电子科技大学 Radar target identification method based on data processing
CN116582133A (en) * 2023-07-12 2023-08-11 东莞市联睿光电科技有限公司 Intelligent management system for data in transformer production process

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7349914B1 (en) * 2004-05-04 2008-03-25 Ncr Corp. Method and apparatus to cluster binary data transactions
CN101894135A (en) * 2009-06-15 2010-11-24 复旦大学 Method for compressing and storing GPS data based on route clustering
CN103678500A (en) * 2013-11-18 2014-03-26 南京邮电大学 Data mining improved type K mean value clustering method based on linear discriminant analysis
CN104506752A (en) * 2015-01-06 2015-04-08 河海大学常州校区 Similar image compression method based on residual compression sensing
CN104883558A (en) * 2015-06-05 2015-09-02 太原科技大学 K-means clustering based depth image encoding method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7349914B1 (en) * 2004-05-04 2008-03-25 Ncr Corp. Method and apparatus to cluster binary data transactions
CN101894135A (en) * 2009-06-15 2010-11-24 复旦大学 Method for compressing and storing GPS data based on route clustering
CN103678500A (en) * 2013-11-18 2014-03-26 南京邮电大学 Data mining improved type K mean value clustering method based on linear discriminant analysis
CN104506752A (en) * 2015-01-06 2015-04-08 河海大学常州校区 Similar image compression method based on residual compression sensing
CN104883558A (en) * 2015-06-05 2015-09-02 太原科技大学 K-means clustering based depth image encoding method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062376A (en) * 2017-12-12 2018-05-22 清华大学 A kind of Time Series Compression storage method and system based on similar operating condition
CN109799483A (en) * 2019-01-25 2019-05-24 中国人民解放军空军研究院战略预警研究所 A kind of data processing method and device
CN109816029A (en) * 2019-01-30 2019-05-28 重庆邮电大学 High-order clustering algorithm based on military operations chain
CN109816029B (en) * 2019-01-30 2023-12-19 重庆邮电大学 High-order clustering division algorithm based on military operation chain
CN111914923A (en) * 2020-07-28 2020-11-10 同济大学 Target distributed identification method based on clustering feature extraction
CN111914923B (en) * 2020-07-28 2022-11-18 同济大学 Target distributed identification method based on clustering feature extraction
CN115622571A (en) * 2022-12-16 2023-01-17 电子科技大学 Radar target identification method based on data processing
CN116582133A (en) * 2023-07-12 2023-08-11 东莞市联睿光电科技有限公司 Intelligent management system for data in transformer production process
CN116582133B (en) * 2023-07-12 2024-02-23 东莞市联睿光电科技有限公司 Intelligent management system for data in transformer production process

Similar Documents

Publication Publication Date Title
CN106452452A (en) Full-pulse data lossless compression method based on K-means clustering
CN102694625B (en) Polarization code decoding method for cyclic redundancy check assistance
CN112953550B (en) Data compression method, electronic device and storage medium
CN105512289A (en) Image retrieval method based on deep learning and Hash
Vasuki et al. A review of vector quantization techniques
CN104348490A (en) Combined data compression algorithm based on effect optimization
Roychowdhury Quantization and centroidal Voronoi tessellations for probability measures on dyadic Cantor sets
CN113258934A (en) Data compression method, system and equipment
CN107273471A (en) A kind of binary electric power time series data index structuring method based on Geohash
Yang et al. One-dimensional deep attention convolution network (ODACN) for signals classification
CN116170027B (en) Data management system and processing method for poison detection equipment
CN101099669A (en) Electrocardiogram data compression method and decoding method based on optimum time frequency space structure code
CN107947803A (en) A kind of method for rapidly decoding of polarization code
CN109075805A (en) Realize the device and method of polarization code
CN113759323A (en) Signal sorting method and device based on improved K-Means combined convolution self-encoder
CN114665884B (en) Time sequence database self-adaptive lossy compression method, system and medium
CN108023597A (en) A kind of reliability of numerical control system data compression method
Huang et al. Latency reduced method for modified successive cancellation decoding of polar codes
CN105391455A (en) Return-to-zero Turbo code starting point and depth blind identification method
CN115567609B (en) Communication method of Internet of things for boiler
CN115622571B (en) Radar target identification method based on data processing
CN102571101A (en) Transmission line malfunction travelling wave data compression method
CN115883301A (en) Signal modulation classification model based on sample recall increment learning and learning method
CN111797991A (en) Deep network model compression system, method and device
CN108259515A (en) A kind of lossless source compression method suitable for transmission link under Bandwidth-Constrained

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170222

WD01 Invention patent application deemed withdrawn after publication